I've been able to avoid this kind of markdown library architecture with very chatty tool feedback. Interaction with a responsive environment is much better than static chunks of "skill" text. For example, imagine a domain constraint:
"You must use tool ABC before calling tool XYZ"
This can either be in some static prompt scheme somewhere, or it can be the live result of a tool call.
If you make everything tool calling and environmental, you effectively have a lazily evaluated & dynamic prompt scheme.
I like to think of this as context for the context. The better you map the environment and descriptions of it to the agent, the less top-down prompting is required.
If you set up the harness correctly, you can run circles around a lot of what passes as AI innovation with powershell in a while loop. Adding static markdown document soup on top of this would only reduce performance in the general case.
Yup! I feel pretty strongly that every little nit pick and instruction you pass into your model is murdering your output. Having a hook that executes on tool calls is significantly better than telling your agent to follow your repos specific format/lint/style/test constraints
I read this post thinking "Finally! Finally someone will explain to me what I've been missing because 'skills' just seem to be re-usable text that help make prompting faster."
Agree on article frustrations. Perhaps a better explanation, skills are just disk-cached prompts conditioned on verified success. The conditioned on verified success part might seem inconsequential, but it’s the whole thing that gives skills their value. Also the fact that their loading can be scoped to a certain calling context.
Tldr: you're doing it wrong but I will not show you how to do it right. I also did not run the bench using my approach but it definitely “vibes better” to me, and I reject your actual research paper.
Come on, show us some actual skills.
That one you use all the time looks a hell of a lot like “I wont a deterministic shell script for something a skill saying ‘run the shell script’”
Is that what you do? How much time do you spend on them? How do you stop the agent from making a bunch of very similar skills? How do you deal with the explosion of the total number of skills impacting your token use? Do you use skills from github, or is that bad practice? Why?
So many unanswered questions; so little content. :/
Skills for repitition are totally valid. Having a version control skill that explains that I use gitea works great. My point is that asking for a skill that tells us if our program will get stuck before taking on a halting problem won't get you any further than just starting the task with xhigh thinking
TL;DR don't have your agent write skills using only its latent knowledge, otherwise you may as well not use a skill in the first place and let it summon that latent knowledge on the fly.
Not sure if this take is correct though. I suspect self-generated skills help the agent avoid having to "decompress" its latent knowledge, which might save tokens? idk, I am not an expert
It seems so obvious: How would it know better than it already does?
Yet I’ve seen people succeed with „write me a prompt“ prompts. The model makes something up, often it makes sense.
They are like plans in that way: It’s not exactly novel knowledge, but it at least encodes it somewhere to make the process verifiable beforehand and a bit more repeatable.
I wouldn’t be surprised if it improves performance a little, just like thinking blocks do (every model reasons now).
Autogenerated content is good scaffolding, but then I have a rule where if I mark heading with "(by-human)" the section shouldn't be changed by LLM without permission.
Having agents create their own skills works well if you also give it a layer of verifiability.
Eg. Ask the agent to write a skill then get it to prompt a subagent to use the skill, then iterate until it verifies the task was completed correctly
Yes, I wrote a forge skill to do this via a/b testing and third agent to judge the result.
https://github.com/bjcoombs/ai-native-toolkit/blob/main/skil...
It hardens a skill through judge-panel refinement rounds, it’s a quality gate that runs after authoring, not an authoring tool.
This is a pretty neat, I suspect that eventually every skill will have some sort of validation/verification loop like this
I've been able to avoid this kind of markdown library architecture with very chatty tool feedback. Interaction with a responsive environment is much better than static chunks of "skill" text. For example, imagine a domain constraint:
"You must use tool ABC before calling tool XYZ"
This can either be in some static prompt scheme somewhere, or it can be the live result of a tool call.
If you make everything tool calling and environmental, you effectively have a lazily evaluated & dynamic prompt scheme.
I like to think of this as context for the context. The better you map the environment and descriptions of it to the agent, the less top-down prompting is required.
If you set up the harness correctly, you can run circles around a lot of what passes as AI innovation with powershell in a while loop. Adding static markdown document soup on top of this would only reduce performance in the general case.
Can you go into more detail about your setup and use cases?
Yup! I feel pretty strongly that every little nit pick and instruction you pass into your model is murdering your output. Having a hook that executes on tool calls is significantly better than telling your agent to follow your repos specific format/lint/style/test constraints
Could you publish your gitlab skill to give an example?
I read this post thinking "Finally! Finally someone will explain to me what I've been missing because 'skills' just seem to be re-usable text that help make prompting faster."
Nope. Still the same.
Agree on article frustrations. Perhaps a better explanation, skills are just disk-cached prompts conditioned on verified success. The conditioned on verified success part might seem inconsequential, but it’s the whole thing that gives skills their value. Also the fact that their loading can be scoped to a certain calling context.
> conditioned on verified success
Thank you! That made it clear to me why it's an useful caching technique.
Agree; posts like this frustrate me.
Tldr: you're doing it wrong but I will not show you how to do it right. I also did not run the bench using my approach but it definitely “vibes better” to me, and I reject your actual research paper.
Come on, show us some actual skills.
That one you use all the time looks a hell of a lot like “I wont a deterministic shell script for something a skill saying ‘run the shell script’”
Is that what you do? How much time do you spend on them? How do you stop the agent from making a bunch of very similar skills? How do you deal with the explosion of the total number of skills impacting your token use? Do you use skills from github, or is that bad practice? Why?
So many unanswered questions; so little content. :/
You're probably using adverbs wrongly.
What if I want a way to open up a latent space prompt without having to type it all out everytime?
Skills for repitition are totally valid. Having a version control skill that explains that I use gitea works great. My point is that asking for a skill that tells us if our program will get stuck before taking on a halting problem won't get you any further than just starting the task with xhigh thinking
TL;DR don't have your agent write skills using only its latent knowledge, otherwise you may as well not use a skill in the first place and let it summon that latent knowledge on the fly.
Not sure if this take is correct though. I suspect self-generated skills help the agent avoid having to "decompress" its latent knowledge, which might save tokens? idk, I am not an expert
It seems so obvious: How would it know better than it already does?
Yet I’ve seen people succeed with „write me a prompt“ prompts. The model makes something up, often it makes sense.
They are like plans in that way: It’s not exactly novel knowledge, but it at least encodes it somewhere to make the process verifiable beforehand and a bit more repeatable.
I wouldn’t be surprised if it improves performance a little, just like thinking blocks do (every model reasons now).
I now have rules to not let agent write any docs or processes. Pretty much anything LLM auto-generated are of zero reuse value.
Autogenerated content is good scaffolding, but then I have a rule where if I mark heading with "(by-human)" the section shouldn't be changed by LLM without permission.
Skills can transfer one session's latent knowledge to all other sessions.