I don't have any benchmarks avalable right now, and honestly I found pretty hard to make them considering that the workflow I have set up is not fully automated, but there is a lot of human intervention in the pre-coding phases.
I feel the problem of token wasting a lot, and actually that was the first reason I had to introduce a hierarchy for instructions, and the artfact indexes: avoid wasting. Then I realized that this approaches helped to keep a lean context that can help the AI agent to deliver better results.
Consider that in the initial phase the token consumption is very limited: is in the implementation phase that the tokens are consumed fast and that the project can proceed with minimal human intevenction. You can try just the fist requirement collection phase to try out the approach, the implementation phase is something pretty boring and not innovative.
I am playing around with building my own similar and am faced with the question you pose.
How can you tell if your prompt process works? I feel like the outputs from SDLC process are so much more high level than could be done with evals, but I am no eval expert.
For sure the proposed approach is more token-consuming than just ask high level the final outcome of the project and make an AI agent to decide everything and deliver the code. This can be acceptable for small personal projects, but if you want to deliver production ready code, you need to be able to control all the intermediate decisions, or at least you want to save and store them. They are needed because otherwise any high level change that you will require will not be able to make focused and coherent enough code changes, with previous forgotten decision that are modified and the code change that will produce lots of side-effects.
Please show your benchmarks and evals to prove that your template actually makes any sense and doesn't waste the credits/tokens/requests/etc.
I don't have any benchmarks avalable right now, and honestly I found pretty hard to make them considering that the workflow I have set up is not fully automated, but there is a lot of human intervention in the pre-coding phases.
I feel the problem of token wasting a lot, and actually that was the first reason I had to introduce a hierarchy for instructions, and the artfact indexes: avoid wasting. Then I realized that this approaches helped to keep a lean context that can help the AI agent to deliver better results.
Consider that in the initial phase the token consumption is very limited: is in the implementation phase that the tokens are consumed fast and that the project can proceed with minimal human intevenction. You can try just the fist requirement collection phase to try out the approach, the implementation phase is something pretty boring and not innovative.
I am playing around with building my own similar and am faced with the question you pose.
How can you tell if your prompt process works? I feel like the outputs from SDLC process are so much more high level than could be done with evals, but I am no eval expert.
How would you benchmark this?
For sure the proposed approach is more token-consuming than just ask high level the final outcome of the project and make an AI agent to decide everything and deliver the code. This can be acceptable for small personal projects, but if you want to deliver production ready code, you need to be able to control all the intermediate decisions, or at least you want to save and store them. They are needed because otherwise any high level change that you will require will not be able to make focused and coherent enough code changes, with previous forgotten decision that are modified and the code change that will produce lots of side-effects.
Figma would make this even more amazing but great work!
[dead]
[dead]