I think the main thing which a lot of these articles miss is it's not just your Agents.md which can give you a model upgrade or the inverse.
But everything your harness looks at could be this. So the skills in your code base, the commands that you've added, the memories that were auto created, they all work towards improving or completely destroying your productivity.
And most of it is hidden. You hear people talk about this all the time where they'll be like, Oh, I use GSD or I use Superpowers and my results have gotten worse.
Your results might have gotten worse precisely because you use them (along with your memories and other skills).
Yes. I agree. I got myself a Strix Halo system and a GLM coding plan in order to explore this. The opaque-ness when using projects out there makes it hard to know what is helping and what isn't.
Clearly, the harness, together with LLMs has utility. But yet, I can
I'd guess the same has always been true for READMEs / human dev docs. Of course it doesn't transfer directly but still feels incredible to be in an age where we can measure such (previously) theoretical things with synthetic programmers.
Yeah isn't this is obvious? Bad docs create triple work: you do it wrong (1) you figure out it's not working because the doc is wrong (2) you do it the right way (3). Between 2 and 3 is figuring out what the right way is, which a good doc ideally shortcuts.
But obviously if you tell somebody "make a boiled egg. To boil an egg you have to crack it into the pan first." That's a lot worse than "make a boiled egg." Especially when you have an infinitely trusting, 0 common sense executor like an agentic model.
the harness (skills, context, memory, state and past decision, implementation history etc) should live in your repo so that you can freely switch IDE/CLIs and models. full protability. don't let OpenAI or Anthropic own your work. https://recursive-mode.dev/introduction
Most of my projects are without an AGENTS.md/CLAUDE.md at the moment. I've found that if the project itself is in good shape - clear docs, comprehensive tests - you don't need to tell the coding agent much in order for it to be productive.
I start a whole lot of my sessions with "Run tests with 'uv run pytest'" and once they've done that they get the idea that they should write tests in a style that fits the existing ones.
That's wild. I couldn't live without my AGENTS to make sure it keeps to the coding styles I prefer. Especially needed on greenfield projects.
A lot of my projects are built with platform versions from the last 12 months which had zero or very small amounts in the core training for the LLM, so they'll tend to avoid using the latest language options unless you prescribe them in AGENTS.
Most of my projects start from a template that has just enough details (like a tests/ folder and a pyproject.toml adding pytest as a dev dependency) for my preferences to start being picked up.
Wouldn't the AGENTS.md containing the line, "When you make changes, they should be tested. Run tests with `uv run pytest`" basically have the same effect and save you some typing? I've never used AGENTS.md myself but I'd like to look into it because I find my agent rediscovering using a bunch of file reads very frequently in my current project.
It would, but then I'd have to copy that file into 100+ repos.
I don't want it in a single global config because I like to stay with the defaults to avoid confusing myself, especially when I'm writing about how coding agents work for other people.
Simon, I really enjoy your live coding sessions. If you do another one, would you mind showing this part as well? It would be extremely educative.
I haven't been able to do without an `.MD` - no agent (CC, Codex, OpenHands) was smart enough to figure out my layout unguided. So much so, a few weeks ago, I had Claude write the guideline below to document the way I like to lay out my tests and modules. I make extensive use of uv workspaces and don't ship tests to production deployments:
**Build tool:** Exclusively `uv_build`. Never `hatchling` or any other build backend.
Pin as `uv_build>=0.6` in every `[build-system]` block.
**Naming convention — flat, distinct package names (NOT a shared namespace):**
Each workspace member uses a *flat* Python package name that is unique across the workspace.
The `uv_build` backend auto-discovers the module by converting the project name (hyphens → underscores):
`base-constants` → `src/base_constants/__init__.py`
`base-domain` → `src/base_domain/__init__.py`
`base-geometry` → `src/base_geometry/__init__.py`
etc.
No `[tool.uv.build-backend] module-name` override is needed because the project name already maps directly.
**Why NOT a `base.*` namespace package:**
`uv_build` cannot support PEP 420-style namespace packages across workspace members.
It maps each project name to exactly one module root; only one member can own `base/__init__.py`.
Attempting `module-name = "base.constants"` treats the dotted name as a nested directory,
not a namespace — it looks for `src/base/constants/__init__.py`. Confirmed by binary string
inspection of the `uv` binary. NEVER attempt namespace packages with this build backend.
**Import style (locked, never change):**
`from base_constants import CONSTANT_A`
`from base.constants import CONSTANT_A` (namespace layout — abandoned)
**Tests member:** `package = false` in `[tool.uv]`, no `[build-system]` block at all.
Tests are never shipped in production; the member exists solely to isolate test dependencies.
**Microservice split story:** When a member needs to become a standalone repository,
only the `[tool.uv.sources]` entry in the consuming `pyproject.toml` changes
(workspace source → PyPI or VCS source). The package code itself is unchanged.
- *Future-phase features: stub, NEVER implement.* When a feature is explicitly
scoped to a later phase (e.g., "Phase 4"), write a one-line stub that raises
`NotImplementedError` plus a docstring describing the Phase 4 contract. A full
implementation spends tokens on untested code that may never ship in its current
form. Exception: if the full implementation is ≤ 5 trivial lines and directly
validates the current phase's math, implement it outright.
```
Similarly, I find it annoying that every agent uses f-strings inside logging calls. Since I added this, that hasn't been a problem:
```
- NEVER use f-strings or .format() inside logging calls. This forces the string to be interpolated immediately, even if the log level (like DEBUG) is currently disabled. You should NEVER do this and if you notice this in existing code, FLAG IT immediately!
By passing the string and the variables separately, you allow the logging library to perform lazy interpolation only when the message is actually being written to the logs. It also increases the caridinality for Structured Logging rendering observability useless!
BAD:
```python
# The f-string is evaluated BEFORE the logging level is checked.
# This:
# - wastes CPU cycles if the log level is higher than INFO
# - increases the caridinality for Structured Logging rendering observability useless!
log.info(f"denominator {denominator} is negative!")
```
GOOD:
```python
# The ONLY right way - logging module only merges the variable into the string if
# the INFO level is actually enabled.
log.info("denominator %s is negative!", denominator)
```
Note: Using this "Good" pattern ALSO helps with Structured Logging. Tools like Sentry or ELK can group logs by the template string ("denominator %s is negative!") rather than seeing every unique f-string as a completely different error type.
I suspect the harness (of which AGENTS and skills and similar things) should be abstracted for better overall performance. This article doesn't really go into detail about model preferences, but some other benchmarks show that different models have differnt preferences of how to use certain tools (probably related to their post training material), and it should really be managed invisibly to me as the end user.
Also curious how well LLMs can self-reflect in a loop, in terms of, here's how the previous iteration went, here's what didn't go well, here's feedback from the human, how do I modify the docs I use in a way that I know I'll do better next time.
I know you can somewhat hillclimb via DSPy but that's hard to generalize.
Claude self-reflects and updates based on feedback pretty well these days, but seems to lean on memory more than updating CLAUDE.md. I don't know how well it adheres to memory, but it seems to work sometimes. I don't like how the memory is stored outside of the project directory though.
Hmm I would hope that's for better quality (if there's somehow model-specific optimizations) or search/retrieval methods down the line. But can't help but feel like the labs/providers might try to lock-in customers by making things non-portable/opaque.
It's cool that they did some measurements, but unfortunately there's not much to learn from the article unless you're using really outdated files that you wrote by hand. The agent should know how to write a good file.
For existing files, the agent will carry on a bad structure unless you specifically ask it to refactor and think about what's actually helpful.
In general, it should be a lean file that tells the agent how to work with the project (short description, table of commands, index of key docs, supporting infra, handful of high-level rules and conventions that apply to everything). Occasionally ask the agent to review and optimize the file, particularly after model upgrades.
Everytime I've asked a model to write it's Agents/Claude file it's been pretty bad actually, are you sure writing these files is actually in distribution right now?
I don't have a ton of experience with this, but every attempt I've made to quickly get an LLM to one-shot an AGENTS file has been too verbose in all the wrong areas. I'm not convinced LLMs are actually good at summarizing anything complex. Maybe some "blessed" prompts will bubble up in time that change my mind.
LLMs don't one-shot anything very well IMHO, but if you make several passes and work through each section it should end up ok. The key word I use is "optimize", but I also press it on what's actually effective. The goal is a small file, so just ruthlessly cut anything that doesn't have high value across the entire project.
Again, the goal is to let the agent know how to work with the project at a high level, not much else. Skills and docs cover the rest.
Interesting that they had a 100% read rate of agents.md. In my test repo lower down agents.md files were occasionally missed by vscode copilot. That fact put me off putting too much effort into nesting agents.md files too much within the repo and I've been focusing on agent skills instead.
The 100% read rate is very harness/CLI dependent. The "original" idea for AGENTS.md was: the AGENTS.md file will be included as-is in the system prompt by the harness, so the agent doesn't have any choice in whether it'll be read or not. For example, this is a shortened form of what opencode sends as a system prompt for a new session when interacting with a provider (displayed in YAML for formatting, and edited for formatting):
model: foo-model
max_tokens: 32000
top_p: 1
messages:
- role: system
content: |
You are opencode, an interactive CLI tool that helps users with software engineering tasks.
Use the instructions below and the tools available to you
# ... snip ...
Here is some useful information about the environment you are running in:
<env>
Working directory: /home/user/dir
Workspace root folder: /
Is directory a git repo: no
Platform: linux
Today's date: Tue Apr 28 2026
</env>
Skills provide specialized instructions and workflows for specific tasks.
Use the skill tool to load a skill when a task matches its description.
No skills are currently available.
Instructions from: /home/user/dir/AGENTS.md
# Overview
This directory holds the entirety of the code for the <dayjob> company. All code lives in Github
under the `<dayjob>` organization, and beneath that Organization is a wide-and-flat set of all
the Git repositories of all source code at <dayjob>. That Github repo structure is replicated in
this directory via `ghorg`.
My AGENTS.md file contents start at the "# Overview" line.
Notice that the harness is just unceremoniously dumping the AGENTS.md file into the exact same text stream as the system prompt, barely contextualizing that hey, starting now, this text is from AGENTS.md and not from the harness.
If you want AGENTS.md to work (likewise, if you want skills or anything else to work) you have to know how the harness is handling/feeding them to the LLM, because no LLM will reliably look on their own.
IME, multiple (good) AGENTS.md is even better. I mostly see them only at the root of a repository, but I spread more out into important subdirectories. They act as a table of contents and spark notes. Putting more focussed AGENTS.md in important places has been even more helpful.
Bonus points if you can force them into context without needing the agent to make a tool call, based on touching the files or systems near them. (my homegrown agent has this feature)
eh, good programmer are goal oriented, today SOTA models still need for the most part step by step guidance, so there's a gap still.
the AGENTS.md pieces that pin specific tool-call shapes or force chain-of-thought before action are coping that ages out, same lifecycle as the retry-with-different-prompt loops or chains of thought prompt most stacks shipped in 2024 to compensate for brittle instruction-following.
not quite there yet, but it's nice to see them being shorter and shorter as model release until all the basic are peeled out by the march of progress and one day only the invariants will be left there
No it's not actually anything like that whatsoever. Programmers are objectively, infinitely more capable than llms. Stop anthropomorphizing algorithms.
I would be very curious which programmers you have in mind when comparing to llms. Like the median programmer, or like the top 10%.
I feel like we've passed the point where an average-effort Claude Code / Cursor / Codex initialized (like basic docs, skills) project would produce a better product (not just code) than if you hired a median programmer to work on that project.
This is like saying programmers are so terrible that you have to think ahead of them and document your code/project so devs don’t make mistakes and anyone who thinks README files are a good thing are coping.
I think the main thing which a lot of these articles miss is it's not just your Agents.md which can give you a model upgrade or the inverse.
But everything your harness looks at could be this. So the skills in your code base, the commands that you've added, the memories that were auto created, they all work towards improving or completely destroying your productivity.
And most of it is hidden. You hear people talk about this all the time where they'll be like, Oh, I use GSD or I use Superpowers and my results have gotten worse.
Your results might have gotten worse precisely because you use them (along with your memories and other skills).
Yes. I agree. I got myself a Strix Halo system and a GLM coding plan in order to explore this. The opaque-ness when using projects out there makes it hard to know what is helping and what isn't.
Clearly, the harness, together with LLMs has utility. But yet, I can
I'd guess the same has always been true for READMEs / human dev docs. Of course it doesn't transfer directly but still feels incredible to be in an age where we can measure such (previously) theoretical things with synthetic programmers.
Yeah isn't this is obvious? Bad docs create triple work: you do it wrong (1) you figure out it's not working because the doc is wrong (2) you do it the right way (3). Between 2 and 3 is figuring out what the right way is, which a good doc ideally shortcuts.
But obviously if you tell somebody "make a boiled egg. To boil an egg you have to crack it into the pan first." That's a lot worse than "make a boiled egg." Especially when you have an infinitely trusting, 0 common sense executor like an agentic model.
the harness (skills, context, memory, state and past decision, implementation history etc) should live in your repo so that you can freely switch IDE/CLIs and models. full protability. don't let OpenAI or Anthropic own your work. https://recursive-mode.dev/introduction
Most of my projects are without an AGENTS.md/CLAUDE.md at the moment. I've found that if the project itself is in good shape - clear docs, comprehensive tests - you don't need to tell the coding agent much in order for it to be productive.
I start a whole lot of my sessions with "Run tests with 'uv run pytest'" and once they've done that they get the idea that they should write tests in a style that fits the existing ones.
That's wild. I couldn't live without my AGENTS to make sure it keeps to the coding styles I prefer. Especially needed on greenfield projects.
A lot of my projects are built with platform versions from the last 12 months which had zero or very small amounts in the core training for the LLM, so they'll tend to avoid using the latest language options unless you prescribe them in AGENTS.
Most of my projects start from a template that has just enough details (like a tests/ folder and a pyproject.toml adding pytest as a dev dependency) for my preferences to start being picked up.
Wouldn't the AGENTS.md containing the line, "When you make changes, they should be tested. Run tests with `uv run pytest`" basically have the same effect and save you some typing? I've never used AGENTS.md myself but I'd like to look into it because I find my agent rediscovering using a bunch of file reads very frequently in my current project.
It would, but then I'd have to copy that file into 100+ repos.
I don't want it in a single global config because I like to stay with the defaults to avoid confusing myself, especially when I'm writing about how coding agents work for other people.
That’s like 30 seconds of work to build a script to do this.
Simon, I really enjoy your live coding sessions. If you do another one, would you mind showing this part as well? It would be extremely educative.
I haven't been able to do without an `.MD` - no agent (CC, Codex, OpenHands) was smart enough to figure out my layout unguided. So much so, a few weeks ago, I had Claude write the guideline below to document the way I like to lay out my tests and modules. I make extensive use of uv workspaces and don't ship tests to production deployments:
```
- uv Workspace Architecture (`uv` v0.11.8+, `packages/` members):
- *Future-phase features: stub, NEVER implement.* When a feature is explicitly scoped to a later phase (e.g., "Phase 4"), write a one-line stub that raises `NotImplementedError` plus a docstring describing the Phase 4 contract. A full implementation spends tokens on untested code that may never ship in its current form. Exception: if the full implementation is ≤ 5 trivial lines and directly validates the current phase's math, implement it outright.```
Similarly, I find it annoying that every agent uses f-strings inside logging calls. Since I added this, that hasn't been a problem:
```
- NEVER use f-strings or .format() inside logging calls. This forces the string to be interpolated immediately, even if the log level (like DEBUG) is currently disabled. You should NEVER do this and if you notice this in existing code, FLAG IT immediately! By passing the string and the variables separately, you allow the logging library to perform lazy interpolation only when the message is actually being written to the logs. It also increases the caridinality for Structured Logging rendering observability useless!
```I suspect the harness (of which AGENTS and skills and similar things) should be abstracted for better overall performance. This article doesn't really go into detail about model preferences, but some other benchmarks show that different models have differnt preferences of how to use certain tools (probably related to their post training material), and it should really be managed invisibly to me as the end user.
Also curious how well LLMs can self-reflect in a loop, in terms of, here's how the previous iteration went, here's what didn't go well, here's feedback from the human, how do I modify the docs I use in a way that I know I'll do better next time.
I know you can somewhat hillclimb via DSPy but that's hard to generalize.
Claude self-reflects and updates based on feedback pretty well these days, but seems to lean on memory more than updating CLAUDE.md. I don't know how well it adheres to memory, but it seems to work sometimes. I don't like how the memory is stored outside of the project directory though.
Hmm I would hope that's for better quality (if there's somehow model-specific optimizations) or search/retrieval methods down the line. But can't help but feel like the labs/providers might try to lock-in customers by making things non-portable/opaque.
Oh yeah, it definitely feels like a scramble to add lock-in features.
It's cool that they did some measurements, but unfortunately there's not much to learn from the article unless you're using really outdated files that you wrote by hand. The agent should know how to write a good file.
For existing files, the agent will carry on a bad structure unless you specifically ask it to refactor and think about what's actually helpful.
In general, it should be a lean file that tells the agent how to work with the project (short description, table of commands, index of key docs, supporting infra, handful of high-level rules and conventions that apply to everything). Occasionally ask the agent to review and optimize the file, particularly after model upgrades.
Everytime I've asked a model to write it's Agents/Claude file it's been pretty bad actually, are you sure writing these files is actually in distribution right now?
I don't have a ton of experience with this, but every attempt I've made to quickly get an LLM to one-shot an AGENTS file has been too verbose in all the wrong areas. I'm not convinced LLMs are actually good at summarizing anything complex. Maybe some "blessed" prompts will bubble up in time that change my mind.
LLMs don't one-shot anything very well IMHO, but if you make several passes and work through each section it should end up ok. The key word I use is "optimize", but I also press it on what's actually effective. The goal is a small file, so just ruthlessly cut anything that doesn't have high value across the entire project.
Again, the goal is to let the agent know how to work with the project at a high level, not much else. Skills and docs cover the rest.
Interesting that they had a 100% read rate of agents.md. In my test repo lower down agents.md files were occasionally missed by vscode copilot. That fact put me off putting too much effort into nesting agents.md files too much within the repo and I've been focusing on agent skills instead.
This is more a harness thing signaling the presence or forcing a read on AGENTS/CLAUDE.md right?
Yes it is, the main feature that differentiates AGENTS.md from other files is that the former is usually loaded into the context automatically.
I often start a new session with tagging AGENTS.md in the prompt just to make sure because I've had the same issue happen a couple of times.
The 100% read rate is very harness/CLI dependent. The "original" idea for AGENTS.md was: the AGENTS.md file will be included as-is in the system prompt by the harness, so the agent doesn't have any choice in whether it'll be read or not. For example, this is a shortened form of what opencode sends as a system prompt for a new session when interacting with a provider (displayed in YAML for formatting, and edited for formatting):
My AGENTS.md file contents start at the "# Overview" line.Notice that the harness is just unceremoniously dumping the AGENTS.md file into the exact same text stream as the system prompt, barely contextualizing that hey, starting now, this text is from AGENTS.md and not from the harness.
If you want AGENTS.md to work (likewise, if you want skills or anything else to work) you have to know how the harness is handling/feeding them to the LLM, because no LLM will reliably look on their own.
I made up an attempt at a solution, https://ktext.dev.
Basically a structured context file, that can be used to generate AGENTS.md, and also can be validated and scored.
I think it could help with this problem.
IME, multiple (good) AGENTS.md is even better. I mostly see them only at the root of a repository, but I spread more out into important subdirectories. They act as a table of contents and spark notes. Putting more focussed AGENTS.md in important places has been even more helpful.
Bonus points if you can force them into context without needing the agent to make a tool call, based on touching the files or systems near them. (my homegrown agent has this feature)
Will people ever get tired of writing AI how-to slop?
The models are so terrible you have to think ahead of them so they don't make mistakes. This is not an upgrade. This is coping behavior.
That's like saying "the programmers are so terrible you have to think ahead of them so they don't make mistakes".
eh, good programmer are goal oriented, today SOTA models still need for the most part step by step guidance, so there's a gap still.
the AGENTS.md pieces that pin specific tool-call shapes or force chain-of-thought before action are coping that ages out, same lifecycle as the retry-with-different-prompt loops or chains of thought prompt most stacks shipped in 2024 to compensate for brittle instruction-following.
not quite there yet, but it's nice to see them being shorter and shorter as model release until all the basic are peeled out by the march of progress and one day only the invariants will be left there
No it's not actually anything like that whatsoever. Programmers are objectively, infinitely more capable than llms. Stop anthropomorphizing algorithms.
I would be very curious which programmers you have in mind when comparing to llms. Like the median programmer, or like the top 10%.
I feel like we've passed the point where an average-effort Claude Code / Cursor / Codex initialized (like basic docs, skills) project would produce a better product (not just code) than if you hired a median programmer to work on that project.
lol no. LLMs are infinitely more capable than programmers.
People really do think too highly of themselves.
This is like saying programmers are so terrible that you have to think ahead of them and document your code/project so devs don’t make mistakes and anyone who thinks README files are a good thing are coping.