Anecdote time! I had Codex GPT 5.4 xhigh generate a Rust proc macro. It's pretty straightforward: use sqlparser to parse a SQL statement and extract the column names of any row-producing queries.
It generated an implementation that worked well, but I hated the ~480 lines of code. The structure and flow was just... weird. It was hard to follow and I was seriously bugged by it.
So I asked it to reimplement it with some simplifications I gave it. It dutifully executed, producing a result >600 lines long. The flow was simpler and easier to follow, but still seemed excessive for the task at hand.
So I rolled up my sleeves and started deleting code and making changes manually. A little bit later, I had it down to <230 lines with a flow that was extremely easy to read and understand.
So yeah, I can totally see many SWE-bench-passing PRs being functionally correct but still terrible code that I would not accept.
If you've got some time, I highly recommend going through the exercise of trying to change the prompt in a way that would produce code similar to what you've achieved manually. Doing a similar exercise really helps to improve agent prompting skills, as it shows how changing parts of the prompt influences the result.
I haven’t had any luck prompting LLMs to “have taste.” They seem to over fixate on instructions (e.g. golfing when asked for concise code) or require specifying so many details and qualifications that the results no longer generalize well to other problems.
Do you have any examples or resources that worked well for you?
This matches what I've seen in practice. The failure mode isn't wrong code - it's code that solves the problem in a way no human would choose. Unnecessary abstractions, ignoring existing patterns in the codebase, fixing the symptom instead of the root cause. Tests pass but the PR makes the codebase worse.
SWE-bench measures "does the patch work" but the actual bar for merging is "does this look like something a team member wrote who understands the project."
> it's code that solves the problem in a way no human would choose
but is it better than than the way a human would choose? And does it matter?
A compiler may write assembly in a way that no humans would choose either. And in the early days of compilers, where most programmers would still hand-weave assembly, they would scoff at these generated assemblies as being bad.
Not to mention that in games like go, the "AI" choosing moves that no humans would choose meant it surpassed humans!
In other words, solving a problem "in a way humans would choose to" is distinct from just solving a problem, and imho, not always required at all.
Is this, along with the comments by the other green usernames on this post, an AI-generated comment? Apologies if it isn't, AIs are trained on human writing and all that, but they're jumping out at me.
Edit: I see another green comment was flagged for AI, might be indicative of something, but why so many green comments on this thread specifically?
* Wording that looks good at first pass, but when you read closely actually makes no sense in the context of the discussion: "fixing the symptom instead of the root cause"
Wait, are regular dashes not em-dashes now considered a sign of AI slop? I've been using dashes since forever.
~The comment you're replying to doesn't have any sentence of the form "X isn't Y, it's Z". It has "It's not X - it's Y".~ I see it now - it does have one "X isn't Y, it's Z" but that's hardly conclusive IMO.
While the comment does have "X but Y", it has a consistent mistake in punctuation - "X, but Y" would be the correct form, won't it? If an LLM produced this, I wouldn't expect the missing punctuation.
How does "fixing the symptom instead of the root cause" not make sense in the context of this discussion which is about coding agents producing marginal PRs.
I'm very much an AI bear but I do think one interesting outcome is going to be that LLMs will stumble upon some weird ways of doing things that no human would have chosen that turn out to be better (Duff's device-level stuff) and they will end up entering the human programming lexicon.
Eh, but if you're in an organization you tune your AGENTS.md, CLAUDE.md, AI code reviews, etc. to have your human driven or automated AI generated code fit the standards of your organization. I don't need models to be smart enough to aggressively try to divine how the organization wants them to do, the users will indeed make that happen. So this post is maybe a little bit over the top.
I am literally right now tuning my PR, Claude instructions, and PR instructions to match our standards.
Funny enough I'm having the opposite problem where Claude is lowering its rating of my PR because my testing, documentation, and error handling is better than the other code in the repository so it doesn't match and therefore gets a worse grade.
I don't need it to try any harder without explicit instructions.
This aligns with something I've been noticing in practice: passing tests and being mergeable are fundamentally different quality bars. Tests verify behavior, but code review evaluates maintainability, readability, and whether the solution fits the broader architecture.
The SWE-bench metric essentially measures "can the AI produce a patch that makes tests pass" — which is closer to "junior developer who got the ticket done" than "experienced engineer who shipped clean code." The gap between those two is exactly where most code review friction lives.
What concerns me more is that as teams start using these benchmarks to evaluate AI coding tools, they might optimize for the wrong thing. A tool that produces mergeable PRs 40% of the time is arguably more valuable than one that passes tests 80% of the time but generates code that requires significant rework. We need benchmarks that capture the full review cycle, not just the CI pipeline.
Would be interesting to see alternative scoring besides “tests pass”, e.g. diff size, abstraction depth added/removed, or whether the solution introduces new modules/dependencies. That would allow to see if “unmergeable” PRs correlate with simple structural signals.
makes sense! we wrote something yesterday about the weaknesses of test-based evals like swe-bench [1]
they are definitely useful but they miss the things that are hard to encode in tests, like spec/intent alignment, scope creep, adherence to codebase patterns, team preferences (risk tolerance, etc)
and those factors are really important. which means that test-evals should be relied upon more as weak/directional priors than as definitive measures of real-world usefulness
I've been working on building out "evals for your repo" based on the theory that commonly used benchmarks like SWE-bench are broken as they are not testing the right / valuable things, and are baked into the training data (see OpenAI's research on this here https://openai.com/index/why-we-no-longer-evaluate-swe-bench...)
Interestingly, I had a similar finding where, on the 3 open-source repos I ran evals on, the models (5.1-codex-mini, 5.3-codex, 5.4) all had relatively similar test scores, but when looking at other metrics, such as code quality, or equivalence to the original PR the task was based on, they had massive differences. posted results here if anyone is curious https://www.stet.sh/leaderboard
This sounds amazing. In particular, I like comps to existing PRs. But I’m also not sure that I want existing PRs to be a template for most things reasonable or best practice.
I’ve been building out internal linters that enforce design patterns I want and raise common code smells (also note tools like eslint allow custom rules which are easy write with something like opus 4.6). The use case is a total refactor of react and fastapi apps. We are suffering from everything’s a snowflake syndrome and just want the same pattern employed across features.
This works pretty well when the linter has a companion agents.md file which explains the architecture and way about the world.
But to get the agent (Claude code opus 4.6 currently) to nail the directory structure and design primitives, and limit some doofus behavior, I still haven’t cracked how to make literally each line of code simple and sensible. And I haven’t figured out how to prevent agents from going out of bounds and doing weird things unless I catch it in review and add another rule.
This is a relatively new endeavor, but my gut is that it’s not much more time (linter rules and perhaps “evals” or a beefy agent review cycle) before I have bespoke linters in place that force what I want from our architecture.
Note that a huge bottleneck to all of this is that the codebase our current team inherited has no tests. It’s too easy to accidentally nuke a screen’s subtle details. It’s also really hard to write good tests without knowing what all of the functionality is. It feels like a blocker to a lot of large-swath agentic changes is a test strategy or solution first then a rigid push for rearchitecture or new design.
For the most part, I think the tests AI have been given have been appropriately designed. At release, many AIs do poorly at them, the models rapidly catch up until the point where a new test is needed.
They should be measuring close to the limits of ability like that.
There will be some that try and steal headlines by targeting the specific nature of the test, but that is not a long term winning solution, the tests keep getting harder. If they make a model good at every test it has seen without regression, then with enough tests, that too ceases to be a problem.
Perhaps there should be an aggregate AI test score that evaluates all of the tests released in a given year. If a model passes the latest test really well but does worse at TestSet2024 than the models before, it would perhaps indicate the model being trained to pass the latest cool test.
There is a problem with people interpreting an AI that passes a test of X,Y or Z as indicating that the AI has the abilities of a human who passes X,Y, or Z. You should tell people who say that, Kasparov makes a nice coffee.
LLM-written code passed SWE Bench even back then. This may just say that SWE Bench is an inadequate test, and should not be used for serious evaluation.
There needs to be a measure (or measures) of the entropy of a codebase that provides a signal of complexity. When you're paying for every token, you want code patterns that convey a lot of immediate information to the agent so that it can either repeat the pattern, or extend it in a way that makes sense. This is probably the next wave of assisted coding (imo), because we're at the stage where writing code works, the quality is mostly decent, but it can be needlessly complex given the context of the existing repo.
There's a way to measure "entropy" of a codebase. Take something like the binary lambda calculus or the triage calculus, convert your program (including libraries, programming language constructs, operating system) into it, and measure the size of the program in it in bits.
You can also measure the crossentropy, which is essentially the whole program entropy above minus entropy of the programming language and functions from standard libraries (i.e. abstractions that you assume are generally known). This is useful to evaluate the conformance to "standard" abstractions.
There is also a way to measure a "maximum entropy" using types, by counting number of states a data type can represent. The maximum entropy of a function is a crossentropy between inputs and outputs (treating the function like a communication channel).
The "difference" (I am not sure how to make them convertible) between "maximum entropy" and "function entropy" (size in bits) then shows how good your understanding (compared to specification expressed in type signature) of the function is.
I have been advocating for some time that we use entropy measures (and information theory) in SW engineering to do estimation of complexity (and thus time required for a change).
There was a measure used during the Toyota Unintended Acceleration case called McCabe Cyclomatic Complexity, I wonder if anyone is using it alongside AI assisted code.
I mean, it's ultimately a string, and the measurement of the entropy of a string is well-studied. The LLM might start gaming that with variable names so you'd need to do the AST instead. I may actually try something like that; cool idea.
This paper doesn’t really tell us much. The cutoff was September of 2025. The models have improved so much that I just don’t know what you can take away from this experiment.
I was totally aligned until I saw the refusal for a comment in the code. When the refusals are pedantic like that, it just weakens the overall findings significantly.
I think a far greater problem is the human psychological and prejudice factor itself. When we heard AI assistance on a PR, we usually dive down the road to thinking about "oh my god is it another LLM slop" (for example: https://github.com/jneem/imbl/pull/149#pullrequestreview-370...). I do use AI but I review the code before I push it, yet most people don't. Once there is a trend, it is easy to form a prejudice and it is hard to go back, unless there is a substantial improvement both in quality and quantity.
Also, some people would have spoken outright rejecting any AI code, but most maintainers would employ the silent treatment tactics. And then when you demand them to review, they either close it or say that "I'm too busy" as an argument. I would call this one of the biggest dick move, because it hurts the most yet you can't find anything wrong with them until they reveal their motives.
I don’t think that’s a fair characterization. You don’t know if the maintainer/reviewer is overloaded. No one is obligated to accept/review PRs and there is no question that the amount of noise has gone up. You are not the main character in that story, so to speak.
>And still, I really hate writing those PR descriptions. Yet you can't just leave it empty.
If you can't write a description in your own words explaining why you're doing it, why should they take the time reviewing it (which they did on the same day you posted it, btw, even if one of them wasn't pleased)? It makes it seem much less likely that you read the code yourself.
You might want to think carefully about why you chose to use the word "demand" there.
(Personally, if I'm rejecting AI slop, I'm not going to do it silently. But there are any number of valid reasons to not jump on someone's PR to review it.)
Really interesting note. That echoes thoughts I’ve had about how much automated benchmark scores really reflect production‑ready code.
For me the big takeaway is that passing doesn't automatically mean it is maintainable, follows established patterns / conventions or have unexpected side effects that real reviewers care about.
Do these benchmarks make any sense? I tried a few local models that seem to be scoring well in SWE but the results were pure rubbish. (For instance MiniMax-M2.5 at 128GB from unslothed - completely unusable).
SWE-bench scores well in the narrow task of making tests pass, which means models get good at exactly that. Real codebases have style constraints, architecture choices, and maintainability concerns that don't show up in any test suite. Not surprised at all that the PRs wouldn't get merged; you'd expect that from an eval that can't measure what reviewers actually care about.
Which quant? I find folks running lower quants complaining, yet they should be running higher quant. Qwen3CoderNext is great, even at Q6. I mistakenly had it loaded for an agentic workflow and was surprised at how well it is.
What is "lower quant"? What is "higher quant"? I mean, I know what they are, but the very people you intend to reach don't know the difference between Q4_K_M and Q6_K and blog posts like [1] have nuggets like "For tests of the type ran here, there appear to be major diminishing returns past Q4".
This makes sense to me based on personal experience. LLM's will do anything to pass tests and get a working result, and it will do very weird things in order to get there. For fun I've tried to get it to do stuff while being purposely ambiguous about the implementation details and sometimes the stuff it does makes me literally laugh out loud. It can write some very strange code.
But hey, the tests pass!
If I force it to use plan mode for everything and babysit it, it can work really well, but it's really just acting as a faster typer for me, which is great. But it requires an experienced dev steering it.
Yeah this matches what we've seen too. The biggest gains we got weren't from switching models, it was from investing in better context, giving the agent a well structured spec, relevant code samples from the repo, and explicit constraints upfront. Without that, even the best models will happily produce working but unmaintainable code. Feels like the whole SWE-bench framing misses this, passing tests is the easy part, fitting into an existing codebase's patterns and conventions is where it actually gets hard.
They might have tried, but this would be pretty hard to achieve for real - especially for the older/worse models. For changes that do more than alter a couple of lines llm output can be very obvious. Stripping all comments from the changeset might go a long way to making it more blind, but then you're missing context that you kinda need to review the code properly.
I feel like I don't have the context for this conversation. If slop is obvious as slop, I feel like we should block it.
If you look at the comment it says what the code following the comment does. It doesn't matter whether it is a human or a machine that wrote it. It is useless. It is actually worse than useless because if someone needs to change the code, now they need to change two things. So in that sense, you just made twice the work for anyone who touches the code after you and for what benefit?
The point is that AI models do these kinds of things all the time. They're not really all that smart or intelligent, they just replicate patterns or boilerplate and then iterate until it sort of appears to work properly.
Anecdote time! I had Codex GPT 5.4 xhigh generate a Rust proc macro. It's pretty straightforward: use sqlparser to parse a SQL statement and extract the column names of any row-producing queries.
It generated an implementation that worked well, but I hated the ~480 lines of code. The structure and flow was just... weird. It was hard to follow and I was seriously bugged by it.
So I asked it to reimplement it with some simplifications I gave it. It dutifully executed, producing a result >600 lines long. The flow was simpler and easier to follow, but still seemed excessive for the task at hand.
So I rolled up my sleeves and started deleting code and making changes manually. A little bit later, I had it down to <230 lines with a flow that was extremely easy to read and understand.
So yeah, I can totally see many SWE-bench-passing PRs being functionally correct but still terrible code that I would not accept.
If you've got some time, I highly recommend going through the exercise of trying to change the prompt in a way that would produce code similar to what you've achieved manually. Doing a similar exercise really helps to improve agent prompting skills, as it shows how changing parts of the prompt influences the result.
I haven’t had any luck prompting LLMs to “have taste.” They seem to over fixate on instructions (e.g. golfing when asked for concise code) or require specifying so many details and qualifications that the results no longer generalize well to other problems.
Do you have any examples or resources that worked well for you?
Yeah I had a similar experience on a smaller scale, reducing a function from 125 lines to 25.
This matches what I've seen in practice. The failure mode isn't wrong code - it's code that solves the problem in a way no human would choose. Unnecessary abstractions, ignoring existing patterns in the codebase, fixing the symptom instead of the root cause. Tests pass but the PR makes the codebase worse.
SWE-bench measures "does the patch work" but the actual bar for merging is "does this look like something a team member wrote who understands the project."
> it's code that solves the problem in a way no human would choose
but is it better than than the way a human would choose? And does it matter?
A compiler may write assembly in a way that no humans would choose either. And in the early days of compilers, where most programmers would still hand-weave assembly, they would scoff at these generated assemblies as being bad.
Not to mention that in games like go, the "AI" choosing moves that no humans would choose meant it surpassed humans!
In other words, solving a problem "in a way humans would choose to" is distinct from just solving a problem, and imho, not always required at all.
Is this, along with the comments by the other green usernames on this post, an AI-generated comment? Apologies if it isn't, AIs are trained on human writing and all that, but they're jumping out at me.
Edit: I see another green comment was flagged for AI, might be indicative of something, but why so many green comments on this thread specifically?
Green username just means new user (under 1 month iirc)
* Dashes
* Triplets
* X isn't Y, it's Z
* X but Y
* Wording that looks good at first pass, but when you read closely actually makes no sense in the context of the discussion: "fixing the symptom instead of the root cause"
Flagged.
> makes no sense in the context of the discussion: "fixing the symptom instead of the root cause"
What's wrong with that?
Wait, are regular dashes not em-dashes now considered a sign of AI slop? I've been using dashes since forever.
~The comment you're replying to doesn't have any sentence of the form "X isn't Y, it's Z". It has "It's not X - it's Y".~ I see it now - it does have one "X isn't Y, it's Z" but that's hardly conclusive IMO.
While the comment does have "X but Y", it has a consistent mistake in punctuation - "X, but Y" would be the correct form, won't it? If an LLM produced this, I wouldn't expect the missing punctuation.
How does "fixing the symptom instead of the root cause" not make sense in the context of this discussion which is about coding agents producing marginal PRs.
You’re being too paranoid
I'm very much an AI bear but I do think one interesting outcome is going to be that LLMs will stumble upon some weird ways of doing things that no human would have chosen that turn out to be better (Duff's device-level stuff) and they will end up entering the human programming lexicon.
These are the same kinds of issues often seen with human junior engineer work.
Lints, beautifiers, better tests?
Eh, but if you're in an organization you tune your AGENTS.md, CLAUDE.md, AI code reviews, etc. to have your human driven or automated AI generated code fit the standards of your organization. I don't need models to be smart enough to aggressively try to divine how the organization wants them to do, the users will indeed make that happen. So this post is maybe a little bit over the top.
I am literally right now tuning my PR, Claude instructions, and PR instructions to match our standards.
Funny enough I'm having the opposite problem where Claude is lowering its rating of my PR because my testing, documentation, and error handling is better than the other code in the repository so it doesn't match and therefore gets a worse grade.
I don't need it to try any harder without explicit instructions.
This aligns with something I've been noticing in practice: passing tests and being mergeable are fundamentally different quality bars. Tests verify behavior, but code review evaluates maintainability, readability, and whether the solution fits the broader architecture.
The SWE-bench metric essentially measures "can the AI produce a patch that makes tests pass" — which is closer to "junior developer who got the ticket done" than "experienced engineer who shipped clean code." The gap between those two is exactly where most code review friction lives.
What concerns me more is that as teams start using these benchmarks to evaluate AI coding tools, they might optimize for the wrong thing. A tool that produces mergeable PRs 40% of the time is arguably more valuable than one that passes tests 80% of the time but generates code that requires significant rework. We need benchmarks that capture the full review cycle, not just the CI pipeline.
Would be interesting to see alternative scoring besides “tests pass”, e.g. diff size, abstraction depth added/removed, or whether the solution introduces new modules/dependencies. That would allow to see if “unmergeable” PRs correlate with simple structural signals.
makes sense! we wrote something yesterday about the weaknesses of test-based evals like swe-bench [1]
they are definitely useful but they miss the things that are hard to encode in tests, like spec/intent alignment, scope creep, adherence to codebase patterns, team preferences (risk tolerance, etc)
and those factors are really important. which means that test-evals should be relied upon more as weak/directional priors than as definitive measures of real-world usefulness
[1] https://voratiq.com/blog/test-evals-are-not-enough/
I've been working on building out "evals for your repo" based on the theory that commonly used benchmarks like SWE-bench are broken as they are not testing the right / valuable things, and are baked into the training data (see OpenAI's research on this here https://openai.com/index/why-we-no-longer-evaluate-swe-bench...)
Interestingly, I had a similar finding where, on the 3 open-source repos I ran evals on, the models (5.1-codex-mini, 5.3-codex, 5.4) all had relatively similar test scores, but when looking at other metrics, such as code quality, or equivalence to the original PR the task was based on, they had massive differences. posted results here if anyone is curious https://www.stet.sh/leaderboard
This sounds amazing. In particular, I like comps to existing PRs. But I’m also not sure that I want existing PRs to be a template for most things reasonable or best practice.
I’ve been building out internal linters that enforce design patterns I want and raise common code smells (also note tools like eslint allow custom rules which are easy write with something like opus 4.6). The use case is a total refactor of react and fastapi apps. We are suffering from everything’s a snowflake syndrome and just want the same pattern employed across features.
This works pretty well when the linter has a companion agents.md file which explains the architecture and way about the world.
But to get the agent (Claude code opus 4.6 currently) to nail the directory structure and design primitives, and limit some doofus behavior, I still haven’t cracked how to make literally each line of code simple and sensible. And I haven’t figured out how to prevent agents from going out of bounds and doing weird things unless I catch it in review and add another rule.
This is a relatively new endeavor, but my gut is that it’s not much more time (linter rules and perhaps “evals” or a beefy agent review cycle) before I have bespoke linters in place that force what I want from our architecture.
Note that a huge bottleneck to all of this is that the codebase our current team inherited has no tests. It’s too easy to accidentally nuke a screen’s subtle details. It’s also really hard to write good tests without knowing what all of the functionality is. It feels like a blocker to a lot of large-swath agentic changes is a test strategy or solution first then a rigid push for rearchitecture or new design.
Nice, I really like your idea. First I've heard of something like that
Working on that too. Lmk if you’re up for a chat?
yea I'm down - feel free to send me an email ben@benr.build
> mid-2024 agents
Is this a post about AI archeology?
It's more about the test than the AI.
For the most part, I think the tests AI have been given have been appropriately designed. At release, many AIs do poorly at them, the models rapidly catch up until the point where a new test is needed.
They should be measuring close to the limits of ability like that.
There will be some that try and steal headlines by targeting the specific nature of the test, but that is not a long term winning solution, the tests keep getting harder. If they make a model good at every test it has seen without regression, then with enough tests, that too ceases to be a problem.
Perhaps there should be an aggregate AI test score that evaluates all of the tests released in a given year. If a model passes the latest test really well but does worse at TestSet2024 than the models before, it would perhaps indicate the model being trained to pass the latest cool test.
There is a problem with people interpreting an AI that passes a test of X,Y or Z as indicating that the AI has the abilities of a human who passes X,Y, or Z. You should tell people who say that, Kasparov makes a nice coffee.
LLM-written code passed SWE Bench even back then. This may just say that SWE Bench is an inadequate test, and should not be used for serious evaluation.
There needs to be a measure (or measures) of the entropy of a codebase that provides a signal of complexity. When you're paying for every token, you want code patterns that convey a lot of immediate information to the agent so that it can either repeat the pattern, or extend it in a way that makes sense. This is probably the next wave of assisted coding (imo), because we're at the stage where writing code works, the quality is mostly decent, but it can be needlessly complex given the context of the existing repo.
There's a way to measure "entropy" of a codebase. Take something like the binary lambda calculus or the triage calculus, convert your program (including libraries, programming language constructs, operating system) into it, and measure the size of the program in it in bits.
You can also measure the crossentropy, which is essentially the whole program entropy above minus entropy of the programming language and functions from standard libraries (i.e. abstractions that you assume are generally known). This is useful to evaluate the conformance to "standard" abstractions.
There is also a way to measure a "maximum entropy" using types, by counting number of states a data type can represent. The maximum entropy of a function is a crossentropy between inputs and outputs (treating the function like a communication channel).
The "difference" (I am not sure how to make them convertible) between "maximum entropy" and "function entropy" (size in bits) then shows how good your understanding (compared to specification expressed in type signature) of the function is.
I have been advocating for some time that we use entropy measures (and information theory) in SW engineering to do estimation of complexity (and thus time required for a change).
Maybe cyclomatic complexity would be a good proxy. It can obviously be gamed but it's obvious when it is
There was a measure used during the Toyota Unintended Acceleration case called McCabe Cyclomatic Complexity, I wonder if anyone is using it alongside AI assisted code.
It is roughly equivalent to diff size: https://entropicthoughts.com/lines-of-code
I mean, it's ultimately a string, and the measurement of the entropy of a string is well-studied. The LLM might start gaming that with variable names so you'd need to do the AST instead. I may actually try something like that; cool idea.
This paper doesn’t really tell us much. The cutoff was September of 2025. The models have improved so much that I just don’t know what you can take away from this experiment.
The test is supposed to be a proxy.
I was totally aligned until I saw the refusal for a comment in the code. When the refusals are pedantic like that, it just weakens the overall findings significantly.
Yeah, why be such a tryhard? Keeping PR friction down is what matters. Just let the codebase slowly deteriorate. It'll be fine.
This seems like an important caveat to the SWE-bench, but the trend is still clearly AI becoming more and more capable.
I think a far greater problem is the human psychological and prejudice factor itself. When we heard AI assistance on a PR, we usually dive down the road to thinking about "oh my god is it another LLM slop" (for example: https://github.com/jneem/imbl/pull/149#pullrequestreview-370...). I do use AI but I review the code before I push it, yet most people don't. Once there is a trend, it is easy to form a prejudice and it is hard to go back, unless there is a substantial improvement both in quality and quantity.
Also, some people would have spoken outright rejecting any AI code, but most maintainers would employ the silent treatment tactics. And then when you demand them to review, they either close it or say that "I'm too busy" as an argument. I would call this one of the biggest dick move, because it hurts the most yet you can't find anything wrong with them until they reveal their motives.
> I would call this one of the biggest dick move
I don’t think that’s a fair characterization. You don’t know if the maintainer/reviewer is overloaded. No one is obligated to accept/review PRs and there is no question that the amount of noise has gone up. You are not the main character in that story, so to speak.
>And still, I really hate writing those PR descriptions. Yet you can't just leave it empty.
If you can't write a description in your own words explaining why you're doing it, why should they take the time reviewing it (which they did on the same day you posted it, btw, even if one of them wasn't pleased)? It makes it seem much less likely that you read the code yourself.
> And then when you demand them to review
You might want to think carefully about why you chose to use the word "demand" there.
(Personally, if I'm rejecting AI slop, I'm not going to do it silently. But there are any number of valid reasons to not jump on someone's PR to review it.)
Really interesting note. That echoes thoughts I’ve had about how much automated benchmark scores really reflect production‑ready code.
For me the big takeaway is that passing doesn't automatically mean it is maintainable, follows established patterns / conventions or have unexpected side effects that real reviewers care about.
Do these benchmarks make any sense? I tried a few local models that seem to be scoring well in SWE but the results were pure rubbish. (For instance MiniMax-M2.5 at 128GB from unslothed - completely unusable).
SWE-bench scores well in the narrow task of making tests pass, which means models get good at exactly that. Real codebases have style constraints, architecture choices, and maintainability concerns that don't show up in any test suite. Not surprised at all that the PRs wouldn't get merged; you'd expect that from an eval that can't measure what reviewers actually care about.
Which quant? I find folks running lower quants complaining, yet they should be running higher quant. Qwen3CoderNext is great, even at Q6. I mistakenly had it loaded for an agentic workflow and was surprised at how well it is.
What is "lower quant"? What is "higher quant"? I mean, I know what they are, but the very people you intend to reach don't know the difference between Q4_K_M and Q6_K and blog posts like [1] have nuggets like "For tests of the type ran here, there appear to be major diminishing returns past Q4".
[1] https://big-stupid-jellyfish.github.io/GFMath/pages/llm-quan...
This makes sense to me based on personal experience. LLM's will do anything to pass tests and get a working result, and it will do very weird things in order to get there. For fun I've tried to get it to do stuff while being purposely ambiguous about the implementation details and sometimes the stuff it does makes me literally laugh out loud. It can write some very strange code.
But hey, the tests pass!
If I force it to use plan mode for everything and babysit it, it can work really well, but it's really just acting as a faster typer for me, which is great. But it requires an experienced dev steering it.
Yeah this matches what we've seen too. The biggest gains we got weren't from switching models, it was from investing in better context, giving the agent a well structured spec, relevant code samples from the repo, and explicit constraints upfront. Without that, even the best models will happily produce working but unmaintainable code. Feels like the whole SWE-bench framing misses this, passing tests is the easy part, fitting into an existing codebase's patterns and conventions is where it actually gets hard.
Edit: Nevermind
Well, no: one of the first things it says is reviewers were blind to human vs. ai.
They might have tried, but this would be pretty hard to achieve for real - especially for the older/worse models. For changes that do more than alter a couple of lines llm output can be very obvious. Stripping all comments from the changeset might go a long way to making it more blind, but then you're missing context that you kinda need to review the code properly.
The comment you're replying to is talking about a hypothetical scenario.
In any case, the blinding didn't stop Reviewer #2 from calling out obvious AI slop. (Figure 5)
I feel like I don't have the context for this conversation. If slop is obvious as slop, I feel like we should block it.
If you look at the comment it says what the code following the comment does. It doesn't matter whether it is a human or a machine that wrote it. It is useless. It is actually worse than useless because if someone needs to change the code, now they need to change two things. So in that sense, you just made twice the work for anyone who touches the code after you and for what benefit?
The point is that AI models do these kinds of things all the time. They're not really all that smart or intelligent, they just replicate patterns or boilerplate and then iterate until it sort of appears to work properly.
> appears to work
That "appears" is doing a lot of heavy lifting.
The code working isn't what's being selected for.
The code looking convincing IS what is being selected for.
That distinction is massive.