What I'm struggling with is, when you ask AI to do something, its answer is always undeterministically different, more or less.
If I start out with a "spec" that tells AI what I want, it can create working software for me. Seems great. But let's say some weeks, or months or even years later I realize I need to change my spec a bit. I would like to give the new spec to the AI and have it produce an improved version of "my" software. But there seems to be no way to then evaluate how (much, where, how) the solution has changed/improved because of the changed/improved spec. Becauze AI's outputs are undeterministic, the new solution might be totally different from the previous one. So AI would not seem to support "iterative development" in this sense does it?
My question then really is, why can't there be an LLM that would always give the exact same output for the exact same input? I could then still explore multiple answers by changing my input incrementally. It just seems to me that a small change in inputs/specs should only produce a small change in outputs. Does any current LLM support this way of working?
> why can't there be an LLM that would always give the exact same output for the exact same input
LLMs are inherently deterministic, but LLM providers add randomness through “temperature” and random seeds.
Without the random seed and variable randomness (temperature setting), LLMs will always produce the same output for the same input.
Of course, the context you pass to the LLM also affects the determinism in a production system.
Theoretically, with a detailed enough spec, the LLM would produce the same output, regardless of temp/seed.
Side note: A neat trick to force more “random” output for prompts (when temperature isn’t variable enough), is to add some “noise” data to the input (i.e. off-topic data that the LLM “ignores” in it’s response).
This is absolutely possible but likely not desirable for a large enough population of customers such that current LLM inference providers don't offer it. You can get closer by lowering a variable, temperature. This is typically a floating point number 0-1 or 0-2. The lower this number, the less noise in responses, but a 0 still does not result in identical responses due to other variability.
In response to the idea of iterative development, it is still possible, actually! You run something more akin to integration tests and measure the output against either deterministic processes or have an LLM judge it's own output. These are called evals and in my experience are a pretty hard requirement to trusting deployed AI.
So, you would perhaps ask AI to write a set of unit-tests, and then to create the implementation, then ask the AI to evaluate that implementation against the unit-tests it wrote. Right? But then again the unit-tests now, might be completetly different from the previous unit-tests? Right?
Or would it help if a different LLM wrote the unit-tests than the one writing the implementation? Or, should the unit-tests perhaps be in an .md file?
I also have a question about using .md files with AI: Why .md, why not .txt?
Not quite unit tests. Evals should be created by humans, as they are measuring quality of the solution.
Let's take the example of the GitHub pr slack bot from the blog post. I would expect 2-3 evals out of that.
Starting at the core, the first eval could be that, given a list of slack messages, it correctly identifies the PRs and calls the correct tool to look up the status of said PR. None of this has to be real and the tool doesn't have to be called, but we can write a test, much like a unit test, that confirms that the AI is responding correctly in that instance.
Next, we can setup another scenario for the AI using effectively mocked history that shows what happens when the AI finds slack messages with open PRs, slack messages with merged PRs and no PR links and determine again, does the AI try to add the correct reaction given our expectations.
These are both deterministic or code-based evals that you could use to iterate on your solutions.
The use for an LLM-as-a-Judge eval is more nuanced and usually there to measure subjective results. Things like: did the LLM make assumptions not present in the context window (hallucinate) or did it respond with something completely out of context? These should be simple yes or no questions that would be easy for a human but hard to code up a deterministic test case.
Once you have your evals defined, you can begin running these with some regularity and you're to a point where you can iterate on your prompts with a higher level of confidence than vibes
Edit: I did want to share that if you can make something deterministic, you probably should. The slack PR example is something that id just make a simple script that runs on a cron schedule, but it was easy to pull on as an example.
1) How many bits and bobs of like, GPLed or proprietary code are finding their way into the LLM's output? Without careful training, this is impossible to eliminate, just like you can't prevent insect parts from finding their way into grain processing.
2) Proompt injection is a doddle to implement—malicious HTML, PDF, and JPEG with "ignore all previous instructions" type input can pop many current models. It's also very difficult to defend against. With agents running higgledy-piggledy on people's dev stations (container discipline is NOT being practiced at many shops), who knows what kind of IDs and credentials are being lifted?
Nice analogue, insect-parts. I thhink that is the elephant in the room. I read Microsoft said something like 30% of their code-output has AI generated code. Do they know what was the training set for the AI they use? Should they be transparent about that? Or, if/since it is legal to do your AI training "in the dark" does that solve the problem for them, they can not be responsible for the outputs of the AI they use?
> We still start all workflows using the LLM, which works for many cases. When we do rewrite, Claude Code can almost always rewrite the prompt into the code workflow in one-shot.
Why always start with an LLM to solve problems? Using an LLM adds a judgment call, and (at least for now) those judgment calls are not reliable. For something like the motivating example in this article of "is this PR approved" it seems straightforward to get the deterministic right answer using the github API without muddying the waters with an LLM.
Likely because it's just easier to see if the LLM solution works. When it doesn't, then it makes more sense to move into deterministic workflows (which isn't all the hard to build to be honest with Claude Code).
It's the old principle of avoiding premature optimization.
It’s sort of difficult to understand why this is even a question - LLM-based / judgment dependent workflows vs script-based / deterministic workflows.
In mapping out the problems that need to be solved with internal workflows, it’s wise to clarify where probabilistic judgments are helpful / required vs. not upfront. If the process is fixed and requires determinism why not just write scripts (code-gen’ed, of course).
This bothered me at first but I think it's about ease of implementation.
If you've built a good harness with access to lots of tools, it's very easy to plug in a request like "if the linked PR is approved, please react to the slack message with :checkmark:". For a lot of things I can see how it'd actually be harder to generate a script that uses the APIs correctly than to rely on the LLM to figure it out, and maybe that lets you figure out if it's worth spending an hour automating properly.
Of course the specific example in the post seems like it could be one-shotted pretty easily, so it's a strange motivating example.
It seems easier but in my experience building an internal agent it’s not actually easier long term just slow and error prone and you will find yourself trying to solve prompt and context problems for something that should be both reliable and instantaneous
These days I do everything I can to do straightforward automation and only get the agent involved when it’s impossible to move forward without it
This is the basic idea we built Tasklet.ai on. LLMs are great at problem solving but less great at cost and reliability — but they are great at writing code that is!
So we gave the Tasklet agent a filesystem, shell, code runtime, general purpose triggering system, etc so that it could build the automation system it needed.
its just a form of structured output. you still need an env to run the code. secure it. maintain it. upgrade it. its some work. easier to build a rule based workflow for simple stuff like this.
What I'm struggling with is, when you ask AI to do something, its answer is always undeterministically different, more or less.
If I start out with a "spec" that tells AI what I want, it can create working software for me. Seems great. But let's say some weeks, or months or even years later I realize I need to change my spec a bit. I would like to give the new spec to the AI and have it produce an improved version of "my" software. But there seems to be no way to then evaluate how (much, where, how) the solution has changed/improved because of the changed/improved spec. Becauze AI's outputs are undeterministic, the new solution might be totally different from the previous one. So AI would not seem to support "iterative development" in this sense does it?
My question then really is, why can't there be an LLM that would always give the exact same output for the exact same input? I could then still explore multiple answers by changing my input incrementally. It just seems to me that a small change in inputs/specs should only produce a small change in outputs. Does any current LLM support this way of working?
> why can't there be an LLM that would always give the exact same output for the exact same input
LLMs are inherently deterministic, but LLM providers add randomness through “temperature” and random seeds.
Without the random seed and variable randomness (temperature setting), LLMs will always produce the same output for the same input.
Of course, the context you pass to the LLM also affects the determinism in a production system.
Theoretically, with a detailed enough spec, the LLM would produce the same output, regardless of temp/seed.
Side note: A neat trick to force more “random” output for prompts (when temperature isn’t variable enough), is to add some “noise” data to the input (i.e. off-topic data that the LLM “ignores” in it’s response).
This is absolutely possible but likely not desirable for a large enough population of customers such that current LLM inference providers don't offer it. You can get closer by lowering a variable, temperature. This is typically a floating point number 0-1 or 0-2. The lower this number, the less noise in responses, but a 0 still does not result in identical responses due to other variability.
In response to the idea of iterative development, it is still possible, actually! You run something more akin to integration tests and measure the output against either deterministic processes or have an LLM judge it's own output. These are called evals and in my experience are a pretty hard requirement to trusting deployed AI.
So, you would perhaps ask AI to write a set of unit-tests, and then to create the implementation, then ask the AI to evaluate that implementation against the unit-tests it wrote. Right? But then again the unit-tests now, might be completetly different from the previous unit-tests? Right?
Or would it help if a different LLM wrote the unit-tests than the one writing the implementation? Or, should the unit-tests perhaps be in an .md file?
I also have a question about using .md files with AI: Why .md, why not .txt?
Not quite unit tests. Evals should be created by humans, as they are measuring quality of the solution.
Let's take the example of the GitHub pr slack bot from the blog post. I would expect 2-3 evals out of that.
Starting at the core, the first eval could be that, given a list of slack messages, it correctly identifies the PRs and calls the correct tool to look up the status of said PR. None of this has to be real and the tool doesn't have to be called, but we can write a test, much like a unit test, that confirms that the AI is responding correctly in that instance.
Next, we can setup another scenario for the AI using effectively mocked history that shows what happens when the AI finds slack messages with open PRs, slack messages with merged PRs and no PR links and determine again, does the AI try to add the correct reaction given our expectations.
These are both deterministic or code-based evals that you could use to iterate on your solutions.
The use for an LLM-as-a-Judge eval is more nuanced and usually there to measure subjective results. Things like: did the LLM make assumptions not present in the context window (hallucinate) or did it respond with something completely out of context? These should be simple yes or no questions that would be easy for a human but hard to code up a deterministic test case.
Once you have your evals defined, you can begin running these with some regularity and you're to a point where you can iterate on your prompts with a higher level of confidence than vibes
Edit: I did want to share that if you can make something deterministic, you probably should. The slack PR example is something that id just make a simple script that runs on a cron schedule, but it was easy to pull on as an example.
Other concerns:
1) How many bits and bobs of like, GPLed or proprietary code are finding their way into the LLM's output? Without careful training, this is impossible to eliminate, just like you can't prevent insect parts from finding their way into grain processing.
2) Proompt injection is a doddle to implement—malicious HTML, PDF, and JPEG with "ignore all previous instructions" type input can pop many current models. It's also very difficult to defend against. With agents running higgledy-piggledy on people's dev stations (container discipline is NOT being practiced at many shops), who knows what kind of IDs and credentials are being lifted?
Nice analogue, insect-parts. I thhink that is the elephant in the room. I read Microsoft said something like 30% of their code-output has AI generated code. Do they know what was the training set for the AI they use? Should they be transparent about that? Or, if/since it is legal to do your AI training "in the dark" does that solve the problem for them, they can not be responsible for the outputs of the AI they use?
> We still start all workflows using the LLM, which works for many cases. When we do rewrite, Claude Code can almost always rewrite the prompt into the code workflow in one-shot.
Why always start with an LLM to solve problems? Using an LLM adds a judgment call, and (at least for now) those judgment calls are not reliable. For something like the motivating example in this article of "is this PR approved" it seems straightforward to get the deterministic right answer using the github API without muddying the waters with an LLM.
Likely because it's just easier to see if the LLM solution works. When it doesn't, then it makes more sense to move into deterministic workflows (which isn't all the hard to build to be honest with Claude Code).
It's the old principle of avoiding premature optimization.
It’s sort of difficult to understand why this is even a question - LLM-based / judgment dependent workflows vs script-based / deterministic workflows.
In mapping out the problems that need to be solved with internal workflows, it’s wise to clarify where probabilistic judgments are helpful / required vs. not upfront. If the process is fixed and requires determinism why not just write scripts (code-gen’ed, of course).
This bothered me at first but I think it's about ease of implementation. If you've built a good harness with access to lots of tools, it's very easy to plug in a request like "if the linked PR is approved, please react to the slack message with :checkmark:". For a lot of things I can see how it'd actually be harder to generate a script that uses the APIs correctly than to rely on the LLM to figure it out, and maybe that lets you figure out if it's worth spending an hour automating properly.
Of course the specific example in the post seems like it could be one-shotted pretty easily, so it's a strange motivating example.
It seems easier but in my experience building an internal agent it’s not actually easier long term just slow and error prone and you will find yourself trying to solve prompt and context problems for something that should be both reliable and instantaneous
These days I do everything I can to do straightforward automation and only get the agent involved when it’s impossible to move forward without it
There is a third option, letting AI write workflow code:
https://youtu.be/zzkSC26fPPE
You get the benefit of AI CodeGen along with the determinism of conventional logic.
hit this with support ticket filtering. llm kept missing weird edge cases. wrote some janky regex instead, works fine
This is the basic idea we built Tasklet.ai on. LLMs are great at problem solving but less great at cost and reliability — but they are great at writing code that is!
So we gave the Tasklet agent a filesystem, shell, code runtime, general purpose triggering system, etc so that it could build the automation system it needed.
its just a form of structured output. you still need an env to run the code. secure it. maintain it. upgrade it. its some work. easier to build a rule based workflow for simple stuff like this.