We Analyzed 413K Agent Runs. Here's What Separates the Ones That Succeed

(twitter.com)

2 points | by lihanc111 5 hours ago ago

4 comments

$lihanc111 5 hours ago

Hey HN,
We dug into 17 billion tokens of behavioral data across 413K AI agent trajectories (CoderForge-Preview) attempting real GitHub issues. Instead of just looking at final SWE-bench scores, we compared successful runs against failing runs on the exact same problem to filter out task-difficulty confounds.
The biggest surprise? Agents are not junior developers, and prompting them to act like humans actively hurts their success rate.
Here is what the data actually shows:
Human exploration rituals predict failure: "View-before-edit" and "grep-before-edit" are negatively correlated with success. Humans do this to build mental models. Agents already have the codebase in their context window; if they are heavily grepping, they aren't learning, they're flailing.
TDD is the ultimate predictor of success: The single strongest behavioral signal of a passing agent is the fraction of early bash commands dedicated exclusively to running the test suite.
The Single Responsibility Principle is law: Agents that scatter edits across 3 or more files in the first 30% of their run see their success rate plummet. Successful agents fix one targeted thing at a time.
Perseverance is a myth: If an agent runs the exact same bash command twice early on, it’s a massive failure signal. They don't adapt; they just get stuck in a loop.
Check out the article for full content!
[-]