For context, I'm a solo developer building UserJot. I've been recently looking deeper into integrating AI into the product but I've been wanting to go a lot deeper than just wrapping a single API call and calling it a day.
So this blog post is mostly my experience trying to reverse engineer other AI agents and experimenting with different approaches for a bit.
When you discuss caching, are you talking about caching the LLM response on your side (what I presume) or actual prompt caching (using the provider cache[0])? Curious why you'd invalidate static content?
I think I need to make this a bit more clear. I was mostly referring to caching the tools (sub-agents) if they are a pure function. But that may be a bit too speicific for the sake of this post.
i.e. you have a query that reads data that doesn't change often, so you can cache the result.
- Did you build your own or are you farming out to say Opencode?
- If you built your own, did you roll from scratch or use a framework? Any comments either way on this?
- How "agentic" (or constrained as the case may be) are your agents in terms of the tools you've provided them?
Not sure if I understand the question, but I'll do my best to answer.
I guess Agents/Agentic are too broad of a term. All of this is really an LLM that has a set of tools that may or may not be other LLMs. You don't really need a framework as long as you can make HTTP calls to openrouter or some other provider and handle tool calling.
I'm using the AI sdk as it plays very nicely with TypeScript and gives you a lot of interesting features like handling server-side/client-side tool calling and synchronization.
My current setup has a mix of tools, some of which are pure functions (i.e. database queries), some of which handle server-side mutations (i.e. scheduling a changelog), and some of which are supposed to run locally on the client (i.e. updating TipTap editor).
Again, hopefully this somewhat answers the question, but happy to provide more details if needed.
Nice post! Can you share a bit more about what variety of tasks you've used agents for? Agents can mean so many different things depending on who you're talking to. A lot of the examples seem like read-only/analysis tasks. Did you also work on tasks where agent took actions and changed state? If yes, did you find any differences in the patterns that worked for those agents?
Sure! So there are both read-only and write-only agents that I'm working on. Basically there's a main agent (main LLM) that is responsible for the overall flow (currently testing GPT-5 Mini for this) and then there are the sub-agents, like I mentioned, that are defined as tools.
Hopefully this isn't against the terms here, but I posted a screenshot here of how I'm trying to build this into the changelog editor to allow users to basically go:
1. What tickets did we recently close?
2. Nice, write a changelog entry for that.
3. Add me as author, tags, and title.
4. Schedule this changelog for monday morning.
Of course, this sounds very trivial on the surface, but it starts to get more complex when you think about how to do find and replace in the text, how to fetch tickets and analyze them, how to write the changelog entry, etc.
When you describe subagents, are those single-tool agents, or are they multi-tool agents with their own ability to reflect and iterate? (i.e. how many actual LLM calls does a subagent make?)
So I have a main agent that is responsible for streering the overall flow, and then there are the sub-agents that, as I mentioned, are stateless functions that are called by the main agent.
Now these could be anything really: API calls, pure computation, or even LLM calls.
"The “Smart Agent” Trap: I tried making agents that could “figure out” what to do. They couldn’t. Be explicit."
So what about this solution is actually agentic?
Overall, it sounds like you sat down and did a proper business process analysis and automated it.
Your subagents for sure have no autonomy and are just execution steps in a classic workflow except you happen to be calling an LLM.
Does the orchestrating agent adapt the process between invocations depending on the data and does it do so in any way more complex than a simple if then branch?
Provide a tool schema that requires deep analysis to fill out correctly. Citations and scores for everything. Examples of high quality citations. Tools that fail or produce low quality results should return instructions about how to recover or interpret the result.
Have agents with different research tools try to corroborate and peer review output from competing agents. This is just one of many collaborative or competitive patterns you can model.
Yeah, it can get quite a bit more dynamic than an if statement if you apply some creativity and clarity and conviction.
You're right that this isn't the "autonomous agent" fantasy that keeps getting hyped.
The agentic part here is more modest but real. The primary agent does make runtime decisions about task decomposition based on the data and calls the subagents (tools) to do the actual work.
So yeah, it's closer to "intelligent workflow orchestration." That's probably a more honest description.
I recently posted here how I’m seeing success with sub agent-based autonomous dev (not “vibe coding” as I actually review every line before I commit, but the same general idea). Different application, but I can confirm every one of the best practices described in this article, as I came to the same conclusions myself.
I agreed with much of this, but I started looking into the enterprise ai systems that large companies are making, and they use agent control via software.
So I tried it. It's much better.
Software is just better at handling discrete tasks that you understand, like mapping agent pathing. There's no point giving that to an AI to do.
The Cordinator "Main Agent" should just call the software to manage the agents.
It works really well in comparison.
You can have the software call claude code via command line, sending the prompt in. You have it create full detail logs of what it's doing, and done, and created.
Maybe I'll change my mind, everything is moving so fast, and we're all in the dark searching around for ideas, but so far it's been working pretty well.
You do lose in the middle visibility to stop it.
I also have it evaluating the outputs to determine how well the agents followed their instructions. That seems key to understanding if more context adds value when comparing agents.
My favorite post in a long time. Super straightforward, confirms my own early experiences but the author has gone further than I have and I can already see how his hard-won insight is going to save me time and money. One change I’m going to make immediately is to use cheaper/faster/simpler models for 3/4 of my tasks. This will also set things up nicely for having some tasks run on local models in the future.
Am I the only one who cannot stand this terrible AI generated writing style?
These awful three sentence abominations:
"Each subagent runs in complete isolation. The primary agent handles all the orchestration. Simple."
"No conversation. No “remember what we talked about.” Just task in, result out."
"No ambiguity. No interpretation. Just data."
AI is good at some things, but copywriting certainly isn't one of them. Whatever the author put into the model to get this output would have been better than what the AI shat out.
"Subagent orchestration" is also a really quick win in Claude. You can just say "spawn a subagent to do each task in X, give it Y context".
This lets you a) run things in parallel if you want, but also b) keep the main agent's context clean, and therefore run much larger, longer running tasks without the "race against the context clock" issue.
I assume you're talking about Claude Code, right? If so, I very much agree with this. A lot of this was actually inspired by how easy it was to do in Claude Code.
I first experimented with allowing the main agent have a "conversation" with sub-agents. For example, I created a database of messages between the main agent and the sub-agents, and allowed both append to it. This kinda worked for a few messages but kept getting stuck on mid-tier models, such as GPT-5 mini.
But from my understanding, their implementation is also similar to the stateless functions I described. (happy to be proven wrong). Sub agents don't communicate back much aside from the final result, and they don't have a conversation history.
The live updates you see are mostly the application layer updating the UI which initially confused me.
trivial with claude code, it's an md file in the agents directory. It has a format, you follow it and have the ai create your first agent with your ideas on what an agent should do, and concern itself with.
Sub agents are about providing less context. Not more.
Sub agents are about providing targeted, specific information to the agents task, instead of having context around a billion other irrelevant topics.
The Database agent does not care at all about the instructions for your Agent Creator Agent. That is called negative context, or poison context, or whatever you want to call it.
So it's about targeted, specific, narrow instructions for a set of tasks.
Sure, but to clarify, so you are probably setting temperature to close to 0 in order to try to get as consistent output as possible based on the input? Have you made any changes to top k and/or top p that you have found makes agents output more consistent/deterministic?
These are the same categories of coordination I've been looking at all day, trying to find the sweet spot in how complex the orchestration can be. I tried to add some context where the agents got it into a cycle of editing the code that another was testing and stuff like that.
I know my next step should be to give the agents a good git pattern but I'm having so much fun just finding the team organization that delivers better stuff. I have them consult with each other in tech choices and have picked what I would have picked
The consensus protocol for choices is one I really liked, and that will maybe do more self correction.
Ive been asking them to illustrate their flow of work and asking for decisions, I need to go back and see if that's the case. Probably would be made easier if I get my git experiment flow down.
The future is tooling for these. If we can team up enough that we get consensus approaching something 'safe' the tools we can give them to run in dry/live mode and have a human validate the process for a time and then once you have enough feedback move into the next thing needing fixing.
I have a lot of apps with cobra cli tooling that resides next to the server code. Being able to pump our docs into an mcp server for tool execution is giving me so many ideas.
Structured generation being the magic that makes agents good is true. The fact that the author puts this front and center implies to me that they actually do build working AI agents.
Author of this post here.
For context, I'm a solo developer building UserJot. I've been recently looking deeper into integrating AI into the product but I've been wanting to go a lot deeper than just wrapping a single API call and calling it a day.
So this blog post is mostly my experience trying to reverse engineer other AI agents and experimenting with different approaches for a bit.
Happy to answer any questions.
When you discuss caching, are you talking about caching the LLM response on your side (what I presume) or actual prompt caching (using the provider cache[0])? Curious why you'd invalidate static content?
[0]: https://docs.anthropic.com/en/docs/build-with-claude/prompt-...
I think I need to make this a bit more clear. I was mostly referring to caching the tools (sub-agents) if they are a pure function. But that may be a bit too speicific for the sake of this post.
i.e. you have a query that reads data that doesn't change often, so you can cache the result.
It seems very doubtful to me that every query would be literally the same (e.g. same hash), if these are plain text descriptions of the subset task.
I mean that depends on how you define the "input" for the tool. Some can be very deterministic like an enum, boolean, number, etc.
Also, regarding your agents (primary and sub):
- Did you build your own or are you farming out to say Opencode? - If you built your own, did you roll from scratch or use a framework? Any comments either way on this? - How "agentic" (or constrained as the case may be) are your agents in terms of the tools you've provided them?
Not sure if I understand the question, but I'll do my best to answer.
I guess Agents/Agentic are too broad of a term. All of this is really an LLM that has a set of tools that may or may not be other LLMs. You don't really need a framework as long as you can make HTTP calls to openrouter or some other provider and handle tool calling.
I'm using the AI sdk as it plays very nicely with TypeScript and gives you a lot of interesting features like handling server-side/client-side tool calling and synchronization.
My current setup has a mix of tools, some of which are pure functions (i.e. database queries), some of which handle server-side mutations (i.e. scheduling a changelog), and some of which are supposed to run locally on the client (i.e. updating TipTap editor).
Again, hopefully this somewhat answers the question, but happy to provide more details if needed.
Nice post! Can you share a bit more about what variety of tasks you've used agents for? Agents can mean so many different things depending on who you're talking to. A lot of the examples seem like read-only/analysis tasks. Did you also work on tasks where agent took actions and changed state? If yes, did you find any differences in the patterns that worked for those agents?
Sure! So there are both read-only and write-only agents that I'm working on. Basically there's a main agent (main LLM) that is responsible for the overall flow (currently testing GPT-5 Mini for this) and then there are the sub-agents, like I mentioned, that are defined as tools.
Hopefully this isn't against the terms here, but I posted a screenshot here of how I'm trying to build this into the changelog editor to allow users to basically go:
https://x.com/ImSh4yy/status/1951012330487079342
1. What tickets did we recently close? 2. Nice, write a changelog entry for that. 3. Add me as author, tags, and title. 4. Schedule this changelog for monday morning.
Of course, this sounds very trivial on the surface, but it starts to get more complex when you think about how to do find and replace in the text, how to fetch tickets and analyze them, how to write the changelog entry, etc.
Hope this helps.
Neat idea!
How tightly scaffolded/harnessed/constrained is your primary agent for a given task? Are you telling it what reasoning strategy to use?
When you describe subagents, are those single-tool agents, or are they multi-tool agents with their own ability to reflect and iterate? (i.e. how many actual LLM calls does a subagent make?)
So I have a main agent that is responsible for streering the overall flow, and then there are the sub-agents that, as I mentioned, are stateless functions that are called by the main agent.
Now these could be anything really: API calls, pure computation, or even LLM calls.
"The “Smart Agent” Trap: I tried making agents that could “figure out” what to do. They couldn’t. Be explicit."
So what about this solution is actually agentic?
Overall, it sounds like you sat down and did a proper business process analysis and automated it.
Your subagents for sure have no autonomy and are just execution steps in a classic workflow except you happen to be calling an LLM.
Does the orchestrating agent adapt the process between invocations depending on the data and does it do so in any way more complex than a simple if then branch?
Provide a tool schema that requires deep analysis to fill out correctly. Citations and scores for everything. Examples of high quality citations. Tools that fail or produce low quality results should return instructions about how to recover or interpret the result.
Have agents with different research tools try to corroborate and peer review output from competing agents. This is just one of many collaborative or competitive patterns you can model.
Yeah, it can get quite a bit more dynamic than an if statement if you apply some creativity and clarity and conviction.
You're right that this isn't the "autonomous agent" fantasy that keeps getting hyped.
The agentic part here is more modest but real. The primary agent does make runtime decisions about task decomposition based on the data and calls the subagents (tools) to do the actual work.
So yeah, it's closer to "intelligent workflow orchestration." That's probably a more honest description.
I recently posted here how I’m seeing success with sub agent-based autonomous dev (not “vibe coding” as I actually review every line before I commit, but the same general idea). Different application, but I can confirm every one of the best practices described in this article, as I came to the same conclusions myself.
https://news.ycombinator.com/item?id=44893025
I agreed with much of this, but I started looking into the enterprise ai systems that large companies are making, and they use agent control via software.
So I tried it. It's much better.
Software is just better at handling discrete tasks that you understand, like mapping agent pathing. There's no point giving that to an AI to do.
The Cordinator "Main Agent" should just call the software to manage the agents.
It works really well in comparison.
You can have the software call claude code via command line, sending the prompt in. You have it create full detail logs of what it's doing, and done, and created.
Maybe I'll change my mind, everything is moving so fast, and we're all in the dark searching around for ideas, but so far it's been working pretty well.
You do lose in the middle visibility to stop it.
I also have it evaluating the outputs to determine how well the agents followed their instructions. That seems key to understanding if more context adds value when comparing agents.
My favorite post in a long time. Super straightforward, confirms my own early experiences but the author has gone further than I have and I can already see how his hard-won insight is going to save me time and money. One change I’m going to make immediately is to use cheaper/faster/simpler models for 3/4 of my tasks. This will also set things up nicely for having some tasks run on local models in the future.
Am I the only one who cannot stand this terrible AI generated writing style?
These awful three sentence abominations:
"Each subagent runs in complete isolation. The primary agent handles all the orchestration. Simple." "No conversation. No “remember what we talked about.” Just task in, result out." "No ambiguity. No interpretation. Just data."
AI is good at some things, but copywriting certainly isn't one of them. Whatever the author put into the model to get this output would have been better than what the AI shat out.
These subagents look like tools
Yes they are tools.
"Subagent orchestration" is also a really quick win in Claude. You can just say "spawn a subagent to do each task in X, give it Y context".
This lets you a) run things in parallel if you want, but also b) keep the main agent's context clean, and therefore run much larger, longer running tasks without the "race against the context clock" issue.
I assume you're talking about Claude Code, right? If so, I very much agree with this. A lot of this was actually inspired by how easy it was to do in Claude Code.
I first experimented with allowing the main agent have a "conversation" with sub-agents. For example, I created a database of messages between the main agent and the sub-agents, and allowed both append to it. This kinda worked for a few messages but kept getting stuck on mid-tier models, such as GPT-5 mini.
But from my understanding, their implementation is also similar to the stateless functions I described. (happy to be proven wrong). Sub agents don't communicate back much aside from the final result, and they don't have a conversation history.
The live updates you see are mostly the application layer updating the UI which initially confused me.
I am doing similar experimentation with Claude Code. I believe you are correct. The primary agent only sees the generated report, nothing more.
Love how you experimented, you are a creative thinker.
As someone totally outside of this space, how do I build an agent? Are there any languages or libraries that are sort of the de facto standard?
trivial with claude code, it's an md file in the agents directory. It has a format, you follow it and have the ai create your first agent with your ideas on what an agent should do, and concern itself with.
You write in plain English text, and put it .claude/agents. That is all.
Do you believe that creating sub agents is a violation of the bitter lesson or is simply a way to add more context?
Sub agents are about providing less context. Not more.
Sub agents are about providing targeted, specific information to the agents task, instead of having context around a billion other irrelevant topics.
The Database agent does not care at all about the instructions for your Agent Creator Agent. That is called negative context, or poison context, or whatever you want to call it.
So it's about targeted, specific, narrow instructions for a set of tasks.
What bitter lesson?
When you say "same output" in
> Every subagent call should be like calling a pure function. Same input, same output. No shared memory. No conversation history. No state.
How are you setting temperature, top k, top p, etc?
So far I've been hardcoding these into the API calls.
Sure, but to clarify, so you are probably setting temperature to close to 0 in order to try to get as consistent output as possible based on the input? Have you made any changes to top k and/or top p that you have found makes agents output more consistent/deterministic?
Yes, temp is close to 0 for most models. For top k and top p, I've been using the default values set in OpenRouter.
These are the same categories of coordination I've been looking at all day, trying to find the sweet spot in how complex the orchestration can be. I tried to add some context where the agents got it into a cycle of editing the code that another was testing and stuff like that.
I know my next step should be to give the agents a good git pattern but I'm having so much fun just finding the team organization that delivers better stuff. I have them consult with each other in tech choices and have picked what I would have picked
The consensus protocol for choices is one I really liked, and that will maybe do more self correction.
Ive been asking them to illustrate their flow of work and asking for decisions, I need to go back and see if that's the case. Probably would be made easier if I get my git experiment flow down.
The future is tooling for these. If we can team up enough that we get consensus approaching something 'safe' the tools we can give them to run in dry/live mode and have a human validate the process for a time and then once you have enough feedback move into the next thing needing fixing.
I have a lot of apps with cobra cli tooling that resides next to the server code. Being able to pump our docs into an mcp server for tool execution is giving me so many ideas.
I found this post very helpful to getting started with agentic systems, what other posts do others recommend?
OG post - https://www.anthropic.com/engineering/building-effective-age...
Structured generation being the magic that makes agents good is true. The fact that the author puts this front and center implies to me that they actually do build working AI agents.
Super practical, no-bullshit write up clearly coming from the trenches. Worth the read.
"no-bullshit write up" about Agentic AI ... LOL
I've never seen so many different names at once for "LLM chat completion API call"