I'm actually currently working on benchmarking the opus 4.7 reasoning curve against real-world tasks, and have found that reasoning effort does not seem to monotonically improve results (at least on the slice I'm looking at). I've been puzzling about this but perhaps the fact that claude code has adaptive thinking explains some of it - even at medium reasoning effort, it can use more thinking tokens when needed to solve a complex problem.
Snapshot of the results (sorry for busted format, ask your llm for dataviz. cant seem to format a good table in the comments)
(custom avg is a set of rubrics used for llm-as-a-judge, graded out of 4)
Practically, the results indicate that medium has better outcomes, or at least the same outcomes, considering variance, as higher reasoning efforts, at a much lower cost/time.
I think it is. We have been using it at my day job and we regularly choose sonnet 4.6 for well scoped things. Opus 4.6 was good but the 4.7 opus model burns so many tokens and dollars that it’s just not worth it given the incremental improvement in results.
They also changed how they count tokens. So you could end up with less reasoning while paying for more tokens. Anthropic’s profit margin is definitely higher on 4.7 then it was an 4.6. I’m pretty sure this was the main driver behind this update.
I'm actually currently working on benchmarking the opus 4.7 reasoning curve against real-world tasks, and have found that reasoning effort does not seem to monotonically improve results (at least on the slice I'm looking at). I've been puzzling about this but perhaps the fact that claude code has adaptive thinking explains some of it - even at medium reasoning effort, it can use more thinking tokens when needed to solve a complex problem.
Snapshot of the results (sorry for busted format, ask your llm for dataviz. cant seem to format a good table in the comments)
Opus 4.7 on GraphQL-go-tools:
Low: 23/29 pass, 10/29 equivalent, 5/29 review-pass, custom avg 2.598, $2.50/task, 384s/task
Medium: 28/29 pass, 14/29 equivalent, 10/29 review-pass, custom avg 2.759, $3.15/task, 451s/task
High: 26/29 pass, 12/29 equivalent, 7/29 review-pass, custom avg 2.670, $5.01/task, 716s/task
Xhigh: 25/29 pass, 11/29 equivalent, 4/29 review-pass, custom avg 2.669, $6.51/task, 804s/task
Max: 27/29 pass, 13/29 equivalent, 8/29 review-pass, custom avg 2.690, $8.84/task, 997s/task
(custom avg is a set of rubrics used for llm-as-a-judge, graded out of 4)
Practically, the results indicate that medium has better outcomes, or at least the same outcomes, considering variance, as higher reasoning efforts, at a much lower cost/time.
I think it is. We have been using it at my day job and we regularly choose sonnet 4.6 for well scoped things. Opus 4.6 was good but the 4.7 opus model burns so many tokens and dollars that it’s just not worth it given the incremental improvement in results.
They also changed how they count tokens. So you could end up with less reasoning while paying for more tokens. Anthropic’s profit margin is definitely higher on 4.7 then it was an 4.6. I’m pretty sure this was the main driver behind this update.
[dead]