I recently downloaded the source code for Chocolate Doom [0], and even though a ton of human labor has been put into making it cross-platform and easy to build (and that work definitely deserves to be commended!), I still couldn't build it immediately on my M1 MacBook.
Asking Claude Code to build it - literally prompting it "fix whatever needs to be fixed until you get the binary to run" - and waiting ~20 minutes was the best investment of non-time I could do... It definitely felt magical. Claude would tweak headers, `make` it, try to run it, and apply more fixes based on the errors it got back.
Now that I think of it, I regret not opening an issue/PR with its findings...!
(((I then went on to make more vibe-changes to the Doom code and made a video out of those which went semi-viral, which I will now unashamedly plug [1])))
I celebrate that I did not have to spend cycles dealing with a non-interesting, non-intellectually-challenging issue aka figuring out the incantations to make a build system happy.
I'm also celebrating (although I forgot to do this - my bad!) that this automated discovery (i.e. of how to fix the build system for machines such as mine) could have been brought back to the Chocolate Doom community, and made the software better for everyone.
And finally, I'm also celebrating that this allowed my (if I may speak so boldly) creativity to express itself by helping me quickly bring a funny idea to life and share it, hopefully entertaining the world/making at least one person laugh/chuckle.
I don't see how any of this makes me redundant though. Efficient? Lazy? Both? Neither? But not redundant. I think! :-)
> Our toughest challenges include cross-compiling to Windows or ARM64 and resurrecting 22-year-old source code from 2003 on modern systems. Some agents needed 135 commands and 15 minutes just to produce a single working binary.
I found that "just" there to be so funny in terms of how far the goal posts moved over these last few years (as TFA does mention). I personally am certain that it would have taken me significantly longer than that to do it myself.
Given the amount of time I've spent wrestling toolchain unpleasantness, particularly for old or embedded systems, I will happily go take a fifteen-minute coffee break while the bot does it for me.
Of course, I will probably do this with OpenAI's option, not $20 of Anthropic API credits.
And here's me, after 4 straight days of wrangling an obscure cross-compilation toolchain to resurrect some ill-fated piece of software from year 2011 in a modern embedded environment.
Letting an agent figure out how to compile old projects is magical. What used to be multiple days of slog is now “compile this, make changes and download tools as needed” with 10 mins of git review to make sure it didn’t do anything stupid.
man I dunno, I was expecting some magic but the tasks seem to boil down to untar, configure with some flags, make install
it does seem the machine is faster than me since I would have to spend a minute to copy each of the --disable-whatever flags for curl
it's somewhat cool to see a computer can do the same half-assed process I do of seeing what linker failures happen and googling for the missing lib flag
So far in this benchmark we based the tasks on a couple of open-source projects (like curl, jq, GNU Coreutils).
Even on those "simple" projects we managed to make the tasks difficult - Claude Opus 4.1 was the only one to correctly cross-compile curl for arm64 (+ make it statically-linked) [1].
In the future we'd like to test it with projects like FFmpeg or chromium - those should be much more difficult.
A long time ago, I did a project where I downloaded a year's worth of nightly builds for Thunderbird so that I could collect nightly code coverage information. Over the course of doing so, I discovered that there was one dependency (pango, I think?) such that no version could support the entire year's worth of source--the newer version didn't work with the older builds, and the older version didn't work with the newer builds.
Come to think of it, in terms of trying to get old code building, the CVS days of Firefox should be interesting... because the first command in that build step is "download the source code" and that CVS server isn't running anymore. And some of the components are downloaded from a CVS tag rather than trunk, and the converted CVS repositories I'm aware of all only converted the trunk and none of the branches or tags.
You didn't make the tasks difficult, you make them easier.
The entire coreutils is reduced to one utility (sha1sum) and the test doesn't even try to feed a real file to it (just a stdin string)[0], same goes to the jq task, there isn't even a json file feed to it, what's being verified[1] is barely a calculator.
These project ship with "make check", please tell AI to use it.
For the _reviving 20 year old code_ type tasks, are the tested outcomes things we'd expect to be in the public domain? For example, in the way the 'SWEBenchVerified' tests are poisoned tests, because the LLMs are able to look up bug fixes in the project git repository.
> because the LLMs are able to look up bug fixes in the project git repository
That's not the (only) problem: Even if you take the internet away, we know/assume that all LLMs are heavily trained on public GitHub repositories. Therefore, they know/remember details of the code and organization in a way they can't for your private (or new, past knowledge cut-off date) code.
Here the tricky part is to make tests that it work correctly.
I did this upgrade a few times, and works for simple stuff like charm (e.g. removing requirements.txt and adding proper pyproject.toml).
Even in Claude Code, it takes some prompting and CLAUDE.md so that it consistently runs uv, rather sometimes `uv run python`, other times `python3 -m` being surprised that some dependency is not available.
Though this is more "LLM uses a variety of open source tools and compilers to compile source," I do wonder about whether there will eventually be a role for transformers in compiling code.
I've mentioned this before, but "sufficiently smart compiler" would be the dream here. Start with high level code or pseudo code, end up with something optimized.
There's been a decent chunk of research in this direction over the years. Michael O'Boyle is pretty active as a researcher in the space, if you're looking for stuff to read: https://www.dcs.ed.ac.uk/home/mob/
For C projects, the task should be passing the full test suite with at least address-sanitizer enabled. Amusing how some would discourage fellow human from using a programming language because of its unsafeness or undefined behavior, yet AI doing unaudited source modification on the same language is encouraged.
the libs in the bench don’t really have an external deps. will be much more interesting to see the results with ffmpeg, Qt, etc. The original source releases from any repo here would also be great candidates: https://github.com/id-software
I’ve been doing this a lot! AI seems to really excel at setting up compiler boilerplate/minor modifications for new arch. I made a simple cpu information utility work on HP PA-RISC and Sparc64 :)
LGTM! I'm sure it comes with a correctness proof, too!
The newer blog posts appear to scan forums like this one for objections ("AI" does not work for legacy code bases) and then create custom "benchmarks" for their sales people to point to if they encounter these objections.
I recently downloaded the source code for Chocolate Doom [0], and even though a ton of human labor has been put into making it cross-platform and easy to build (and that work definitely deserves to be commended!), I still couldn't build it immediately on my M1 MacBook.
Asking Claude Code to build it - literally prompting it "fix whatever needs to be fixed until you get the binary to run" - and waiting ~20 minutes was the best investment of non-time I could do... It definitely felt magical. Claude would tweak headers, `make` it, try to run it, and apply more fixes based on the errors it got back.
Now that I think of it, I regret not opening an issue/PR with its findings...!
(((I then went on to make more vibe-changes to the Doom code and made a video out of those which went semi-viral, which I will now unashamedly plug [1])))
[0] https://github.com/chocolate-doom/chocolate-doom
[1] https://www.youtube.com/watch?v=LcnBXtttF28
So essentially, you are redundant now and celebrate it.
I celebrate that I did not have to spend cycles dealing with a non-interesting, non-intellectually-challenging issue aka figuring out the incantations to make a build system happy.
I'm also celebrating (although I forgot to do this - my bad!) that this automated discovery (i.e. of how to fix the build system for machines such as mine) could have been brought back to the Chocolate Doom community, and made the software better for everyone.
And finally, I'm also celebrating that this allowed my (if I may speak so boldly) creativity to express itself by helping me quickly bring a funny idea to life and share it, hopefully entertaining the world/making at least one person laugh/chuckle.
I don't see how any of this makes me redundant though. Efficient? Lazy? Both? Neither? But not redundant. I think! :-)
Naturally, the primary source of purpose in life: Making Chocolate Doom compile.
Unless you are selling a service to compile things for people I'm not sure who is being made redundant here.
Precisely
It's like saying chisel made carpenter redundant. AI still needs an operator and then more people to actually make the output production ready.
> Our toughest challenges include cross-compiling to Windows or ARM64 and resurrecting 22-year-old source code from 2003 on modern systems. Some agents needed 135 commands and 15 minutes just to produce a single working binary.
I found that "just" there to be so funny in terms of how far the goal posts moved over these last few years (as TFA does mention). I personally am certain that it would have taken me significantly longer than that to do it myself.
Given the amount of time I've spent wrestling toolchain unpleasantness, particularly for old or embedded systems, I will happily go take a fifteen-minute coffee break while the bot does it for me.
Of course, I will probably do this with OpenAI's option, not $20 of Anthropic API credits.
15 minutes?
And here's me, after 4 straight days of wrangling an obscure cross-compilation toolchain to resurrect some ill-fated piece of software from year 2011 in a modern embedded environment.
Letting an agent figure out how to compile old projects is magical. What used to be multiple days of slog is now “compile this, make changes and download tools as needed” with 10 mins of git review to make sure it didn’t do anything stupid.
man I dunno, I was expecting some magic but the tasks seem to boil down to untar, configure with some flags, make install
it does seem the machine is faster than me since I would have to spend a minute to copy each of the --disable-whatever flags for curl
it's somewhat cool to see a computer can do the same half-assed process I do of seeing what linker failures happen and googling for the missing lib flag
Author here.
So far in this benchmark we based the tasks on a couple of open-source projects (like curl, jq, GNU Coreutils).
Even on those "simple" projects we managed to make the tasks difficult - Claude Opus 4.1 was the only one to correctly cross-compile curl for arm64 (+ make it statically-linked) [1].
In the future we'd like to test it with projects like FFmpeg or chromium - those should be much more difficult.
[1] https://www.compilebench.com/curl-ssl-arm64-static/
A long time ago, I did a project where I downloaded a year's worth of nightly builds for Thunderbird so that I could collect nightly code coverage information. Over the course of doing so, I discovered that there was one dependency (pango, I think?) such that no version could support the entire year's worth of source--the newer version didn't work with the older builds, and the older version didn't work with the newer builds.
Come to think of it, in terms of trying to get old code building, the CVS days of Firefox should be interesting... because the first command in that build step is "download the source code" and that CVS server isn't running anymore. And some of the components are downloaded from a CVS tag rather than trunk, and the converted CVS repositories I'm aware of all only converted the trunk and none of the branches or tags.
You didn't make the tasks difficult, you make them easier.
The entire coreutils is reduced to one utility (sha1sum) and the test doesn't even try to feed a real file to it (just a stdin string)[0], same goes to the jq task, there isn't even a json file feed to it, what's being verified[1] is barely a calculator.
These project ship with "make check", please tell AI to use it.
[0] https://github.com/QuesmaOrg/CompileBench/blob/86d9aeda88a16...
[1] https://github.com/QuesmaOrg/CompileBench/blob/86d9aeda88a16...
For the _reviving 20 year old code_ type tasks, are the tested outcomes things we'd expect to be in the public domain? For example, in the way the 'SWEBenchVerified' tests are poisoned tests, because the LLMs are able to look up bug fixes in the project git repository.
> because the LLMs are able to look up bug fixes in the project git repository
That's not the (only) problem: Even if you take the internet away, we know/assume that all LLMs are heavily trained on public GitHub repositories. Therefore, they know/remember details of the code and organization in a way they can't for your private (or new, past knowledge cut-off date) code.
Excellent benchmark. May I suggest a extension: "port any pre-uv Python ML codebase to uv so that it can actually be reliably reproduced"?
Here the tricky part is to make tests that it work correctly.
I did this upgrade a few times, and works for simple stuff like charm (e.g. removing requirements.txt and adding proper pyproject.toml).
Even in Claude Code, it takes some prompting and CLAUDE.md so that it consistently runs uv, rather sometimes `uv run python`, other times `python3 -m` being surprised that some dependency is not available.
I am doing this now. What are your instructions in CLAUDE.md? thx
This is a really good benchmark. So much time is spent on these messy types of tasks and no one really likes doing it.
Now if it could fix React Native builds after package upgrades I'd be impressed...
Though this is more "LLM uses a variety of open source tools and compilers to compile source," I do wonder about whether there will eventually be a role for transformers in compiling code.
I've mentioned this before, but "sufficiently smart compiler" would be the dream here. Start with high level code or pseudo code, end up with something optimized.
There's been a decent chunk of research in this direction over the years. Michael O'Boyle is pretty active as a researcher in the space, if you're looking for stuff to read: https://www.dcs.ed.ac.uk/home/mob/
For C projects, the task should be passing the full test suite with at least address-sanitizer enabled. Amusing how some would discourage fellow human from using a programming language because of its unsafeness or undefined behavior, yet AI doing unaudited source modification on the same language is encouraged.
the libs in the bench don’t really have an external deps. will be much more interesting to see the results with ffmpeg, Qt, etc. The original source releases from any repo here would also be great candidates: https://github.com/id-software
I hadn’t thought of that use case. Say for example you find 1990’s Clipper code and want to give it a try on a modern Linux. Thanks
I’ve been doing this a lot! AI seems to really excel at setting up compiler boilerplate/minor modifications for new arch. I made a simple cpu information utility work on HP PA-RISC and Sparc64 :)
Curious for the ultimate benchmark - can AI compile Doom an on arbitrary device?
that, & how well does it cope with Perl?
Claude is good enough at Perl with lots of hand-holding and reiterations, according to my experiences.
I have tried to get Claude to compile arbitrary C++ projects with Emscripten, and its track record is about as good as mine.
If you asked me to do this I would want clarification on "cross-compile", "arm64" and "statically".
LGTM! I'm sure it comes with a correctness proof, too!
The newer blog posts appear to scan forums like this one for objections ("AI" does not work for legacy code bases) and then create custom "benchmarks" for their sales people to point to if they encounter these objections.