If this approach turns out to be valuable, it's unlikely that it has anything to do with having multiple actual agents, but rather that it's valuable to have 2 configurations (system prompt, model, temp, context pruning, toolset etc.) of inside the same agent being swapped back and forth.
The PLAN.md question is the one worth pulling on. Once the plan lives in git or the PR it's already downstream of intent and whoever defined what to build has already handed off. The harder problem is giving agents access to the original intent, not just the implementation plan derived from it. When there's drift between what was planned and what got built, a git-resident PLAN.md makes it hard to trace back to why the decision was made in the first place.
You can also create a skill for reviewing (which calls gemini/codex as a command line tool) and set instructions on how and when to use. Very flexible.
Nice - I do something similar in a semi manual way.
I do find Codex very good at reviewing work marked as completed by Claude, especially when I get Claude to write up its work with a why,where & how doc.
It’s very rare Claude has fully completed the task successfully and Codex doesn’t find issues.
I find both to be true. I use Claude for most of the implementation, and Codex always catches mistakes. Always. But both of them benefit from being asked if they’re sure they did everything.
I agree! Right now it is leveraging the Codex App Server, which is open-source and very well implemented, but using Claude Code Channels is probably a bit hacky.
The good thing is that it establishes a direct connection so it's already much better than having one agent spawn the other and wait for its output, or read/write to a shared .md file -- but it would be cool to make it work for all agent harnesses.
I think the A2A space is wide open. Great to see this approach using App Server and Channels.
I tried built something similar (at a high level) for a more B2C use case for OpenClaw https://github.com/agentlink-dev/agentlink users. Currently I think the major Agents have not fully owned the "wake the Agent" use case fully.
Regardless this is a very cool approach. All the best.
I prefer claude for generation / creativity, codex for bull-headed, accurate complaining and audit. Very rarely claude just doesn't "get it" and it makes sense to have codex direct edit. But generally I think it's happiest and best used complaining.
This is interesting for code, but I'm curious about agent-to-agent coordination for ops tasks — like one agent detecting a database anomaly and another auto-remediating it
I think a lot of people/companies are integrating workflows like that, it's just separate from the point of agent pair coding.
The interesting thing here is agents working together to be better at a single task. Not agents integrated in a workflow. There's a lot of opportunity in "if this then that" scenarios that has nothing to do with two agents communicating on one single element of a problem, it's just Agent detect -> agent solve (-> Agent review? Agent deploy? Etc.)
Multi turn review of code written by cc reviewed by codex works pretty well. Been one of the only ways to be able to deliver larger scoped features without constant bugs. I've seen them do 10-15 rounds of fix and review until complete.
Also implemented this as a gh action, works well for sentry to gh to auto triage to fix pr.
Yes I’ve had a lot of success with this too. I found with prompt tightening I seldom do more than 5 rounds now, but it also does an explicit plan step with plan review.
Currently I’m authoring with codex and reviewing with opus.
Even with the same model (--self-review), that makes a huge difference, and immediately highlights how bad the first iterations of an LLM output can be.
If this approach turns out to be valuable, it's unlikely that it has anything to do with having multiple actual agents, but rather that it's valuable to have 2 configurations (system prompt, model, temp, context pruning, toolset etc.) of inside the same agent being swapped back and forth.
I’m curious whether anyone has measured this systematically. Right now most of the evidence for multi-agent setups still feels anecdotal.
And expensive, exactly the way a pay per use product would push its customers…
“It’s not working well enough!” We tell them. They respond with “Have you tried using it more?”
Completely with you on this! But then we need to define the cirteria for comparison. Might not be that easy unfortunately
The PLAN.md question is the one worth pulling on. Once the plan lives in git or the PR it's already downstream of intent and whoever defined what to build has already handed off. The harder problem is giving agents access to the original intent, not just the implementation plan derived from it. When there's drift between what was planned and what got built, a git-resident PLAN.md makes it hard to trace back to why the decision was made in the first place.
You can also create a skill for reviewing (which calls gemini/codex as a command line tool) and set instructions on how and when to use. Very flexible.
Nice - I do something similar in a semi manual way.
I do find Codex very good at reviewing work marked as completed by Claude, especially when I get Claude to write up its work with a why,where & how doc.
It’s very rare Claude has fully completed the task successfully and Codex doesn’t find issues.
I created the first version of loop after getting tired of doing this manually!
I’m going to take a look today!
Claude is also good at that. I made a habit of asking "are you sure?" after a complex task. It usually says it overlooked something.
I find both to be true. I use Claude for most of the implementation, and Codex always catches mistakes. Always. But both of them benefit from being asked if they’re sure they did everything.
The vibes are great. But there’s a need for more science on this multi agent thing.
I agree! Right now it is leveraging the Codex App Server, which is open-source and very well implemented, but using Claude Code Channels is probably a bit hacky.
The good thing is that it establishes a direct connection so it's already much better than having one agent spawn the other and wait for its output, or read/write to a shared .md file -- but it would be cool to make it work for all agent harnesses.
Open to ideas! The repo is open-source.
This one: https://github.com/openai/codex/tree/main/codex-rs/app-serve...
there’s a need for more science on this multi agent thing
I have been trying a similar setup since last week using https://rjcorwin.github.io/cook/
Oh, that's cool!
I think the A2A space is wide open. Great to see this approach using App Server and Channels. I tried built something similar (at a high level) for a more B2C use case for OpenClaw https://github.com/agentlink-dev/agentlink users. Currently I think the major Agents have not fully owned the "wake the Agent" use case fully. Regardless this is a very cool approach. All the best.
I prefer claude for generation / creativity, codex for bull-headed, accurate complaining and audit. Very rarely claude just doesn't "get it" and it makes sense to have codex direct edit. But generally I think it's happiest and best used complaining.
This is interesting for code, but I'm curious about agent-to-agent coordination for ops tasks — like one agent detecting a database anomaly and another auto-remediating it
I think a lot of people/companies are integrating workflows like that, it's just separate from the point of agent pair coding.
The interesting thing here is agents working together to be better at a single task. Not agents integrated in a workflow. There's a lot of opportunity in "if this then that" scenarios that has nothing to do with two agents communicating on one single element of a problem, it's just Agent detect -> agent solve (-> Agent review? Agent deploy? Etc.)
JDS wrote about this https://jdsemrau.substack.com/p/pair-programming-superbill-w...
Multi turn review of code written by cc reviewed by codex works pretty well. Been one of the only ways to be able to deliver larger scoped features without constant bugs. I've seen them do 10-15 rounds of fix and review until complete.
Also implemented this as a gh action, works well for sentry to gh to auto triage to fix pr.
How do you do this? Are you just switching between clis? Or is there a tool that uses the models in that way?
Yes I’ve had a lot of success with this too. I found with prompt tightening I seldom do more than 5 rounds now, but it also does an explicit plan step with plan review.
Currently I’m authoring with codex and reviewing with opus.
Good reminder: don't forget the plan review!
I systematically use reviewers agents in Swival: https://swival.dev/pages/reviews.html
Even with the same model (--self-review), that makes a huge difference, and immediately highlights how bad the first iterations of an LLM output can be.
The circle of slop.