Yeah, I did notice a drop in quality. I dismissed it as saving the hard tickets for last, but I find myself having to explain how to do it better.
Often I ask it for approaches on how to to do something. It would give me say three options. For 4.6, two of those options were usually good. For 4.7, the options were all bad, and I'd have to explain that maybe it hasn't considered so and so option.
I never really understood these benchmarks. Unless there's an increase of 2x or so in a benchmark, the benchmark never actually reflects the real world performance.
Anthropic provides details regarding between Opus 4.7 and 4.6, including Opus 4.7 doesn't call tools as frequently as 4.6 due to being more capable. Depending on the task at hand, that could a good thing or not so good [1].
For example, regarding instruction following:
Claude Opus 4.7 interprets prompts more literally and explicitly than Claude Opus 4.6, particularly at lower effort levels. It will not silently generalize an instruction from one item to another, and it will not infer requests you didn't make.
The one-shot rate doesn't factor in context size directly, it just tracks whether an edit succeeded without retries. That said, a detailed CLAUDE.md probably helps both models equally since the context is the same either way. Would be interesting to isolate that though.
I have started to rollback to 4.6 for some important task as I was working with it from longtime but I am still using 4.7 for some fresh task.
On the fewer tools per turn, yeah I think that lines up with what the other reply mentioned about 4.7 being more "in its head." I have not specifically tracked hallucinated project structure but the higher retry rate suggests it is getting things wrong more often when it skips the read step
Yeah, I did notice a drop in quality. I dismissed it as saving the hard tickets for last, but I find myself having to explain how to do it better.
Often I ask it for approaches on how to to do something. It would give me say three options. For 4.6, two of those options were usually good. For 4.7, the options were all bad, and I'd have to explain that maybe it hasn't considered so and so option.
I never really understood these benchmarks. Unless there's an increase of 2x or so in a benchmark, the benchmark never actually reflects the real world performance.
[flagged]
Anthropic provides details regarding between Opus 4.7 and 4.6, including Opus 4.7 doesn't call tools as frequently as 4.6 due to being more capable. Depending on the task at hand, that could a good thing or not so good [1].
For example, regarding instruction following:
Claude Opus 4.7 interprets prompts more literally and explicitly than Claude Opus 4.6, particularly at lower effort levels. It will not silently generalize an instruction from one item to another, and it will not infer requests you didn't make.
[1]: https://platform.claude.com/docs/en/build-with-claude/prompt...
[flagged]
The one-shot rate doesn't factor in context size directly, it just tracks whether an edit succeeded without retries. That said, a detailed CLAUDE.md probably helps both models equally since the context is the same either way. Would be interesting to isolate that though.
I have started to rollback to 4.6 for some important task as I was working with it from longtime but I am still using 4.7 for some fresh task.
On the fewer tools per turn, yeah I think that lines up with what the other reply mentioned about 4.7 being more "in its head." I have not specifically tracked hallucinated project structure but the higher retry rate suggests it is getting things wrong more often when it skips the read step