Astro/Solid - Hacker News

$bisonbear 6 hours ago

I'm actually currently working on benchmarking the opus 4.7 reasoning curve against real-world tasks, and have found that reasoning effort does not seem to monotonically improve results (at least on the slice I'm looking at). I've been puzzling about this but perhaps the fact that claude code has adaptive thinking explains some of it - even at medium reasoning effort, it can use more thinking tokens when needed to solve a complex problem.

Snapshot of the results (sorry for busted format, ask your llm for dataviz. cant seem to format a good table in the comments)

Opus 4.7 on GraphQL-go-tools:

Low: 23/29 pass, 10/29 equivalent, 5/29 review-pass, custom avg 2.598, $2.50/task, 384s/task

Medium: 28/29 pass, 14/29 equivalent, 10/29 review-pass, custom avg 2.759, $3.15/task, 451s/task

High: 26/29 pass, 12/29 equivalent, 7/29 review-pass, custom avg 2.670, $5.01/task, 716s/task

Xhigh: 25/29 pass, 11/29 equivalent, 4/29 review-pass, custom avg 2.669, $6.51/task, 804s/task

Max: 27/29 pass, 13/29 equivalent, 8/29 review-pass, custom avg 2.690, $8.84/task, 997s/task

(custom avg is a set of rubrics used for llm-as-a-judge, graded out of 4)

Practically, the results indicate that medium has better outcomes, or at least the same outcomes, considering variance, as higher reasoning efforts, at a much lower cost/time.

$ 10 hours ago

[deleted]

$gigatexal 11 hours ago

I think it is. We have been using it at my day job and we regularly choose sonnet 4.6 for well scoped things. Opus 4.6 was good but the 4.7 opus model burns so many tokens and dollars that it’s just not worth it given the incremental improvement in results.

[-]

$vincent_s 10 hours ago

They also changed how they count tokens. So you could end up with less reasoning while paying for more tokens. Anthropic’s profit margin is definitely higher on 4.7 then it was an 4.6. I’m pretty sure this was the main driver behind this update.

$foresightlab 10 hours ago

[dead]

Is Opus 4.7 a Downgrade?