We dug into 17 billion tokens of behavioral data across 413K AI agent trajectories (CoderForge-Preview) attempting real GitHub issues. Instead of just looking at final SWE-bench scores, we compared successful runs against failing runs on the exact same problem to filter out task-difficulty confounds.
The biggest surprise? Agents are not junior developers, and prompting them to act like humans actively hurts their success rate.
Here is what the data actually shows:
Human exploration rituals predict failure: "View-before-edit" and "grep-before-edit" are negatively correlated with success. Humans do this to build mental models. Agents already have the codebase in their context window; if they are heavily grepping, they aren't learning, they're flailing.
TDD is the ultimate predictor of success: The single strongest behavioral signal of a passing agent is the fraction of early bash commands dedicated exclusively to running the test suite.
The Single Responsibility Principle is law: Agents that scatter edits across 3 or more files in the first 30% of their run see their success rate plummet. Successful agents fix one targeted thing at a time.
Perseverance is a myth: If an agent runs the exact same bash command twice early on, it’s a massive failure signal. They don't adapt; they just get stuck in a loop.
"This post is the first in a series. We are extending this analysis to more realistic workloads beyond artificial SWE benchmarks. Follow the account and stay tuned.---"
Hey HN,
We dug into 17 billion tokens of behavioral data across 413K AI agent trajectories (CoderForge-Preview) attempting real GitHub issues. Instead of just looking at final SWE-bench scores, we compared successful runs against failing runs on the exact same problem to filter out task-difficulty confounds.
The biggest surprise? Agents are not junior developers, and prompting them to act like humans actively hurts their success rate.
Here is what the data actually shows:
Human exploration rituals predict failure: "View-before-edit" and "grep-before-edit" are negatively correlated with success. Humans do this to build mental models. Agents already have the codebase in their context window; if they are heavily grepping, they aren't learning, they're flailing.
TDD is the ultimate predictor of success: The single strongest behavioral signal of a passing agent is the fraction of early bash commands dedicated exclusively to running the test suite.
The Single Responsibility Principle is law: Agents that scatter edits across 3 or more files in the first 30% of their run see their success rate plummet. Successful agents fix one targeted thing at a time.
Perseverance is a myth: If an agent runs the exact same bash command twice early on, it’s a massive failure signal. They don't adapt; they just get stuck in a loop.
Check out the article for full content!
"This post is the first in a series. We are extending this analysis to more realistic workloads beyond artificial SWE benchmarks. Follow the account and stay tuned.---"
Did something get cut off at the end?
Actually not, i think the --- was just mistakenly typed XD
Well these days all eyes are on dashes... You commonly see "--" when humans want to use [em dash] but "---" is unusual.