We have been experimenting with routing inference across LLMs, and the path has been full of wrong turns.
Our first attempt was to just use a large LLM itself to decide routing. It was too costly and the decisions were unreliable.
Next we tried training a small fine-tuned LLM as a router. It was cheaper, but the outputs were poor and not trustworthy.
Then we wrote heuristics to map prompt types to model IDs. That worked for a while, but it was brittle. Every API change or workload shift broke it.
Eventually we shifted to thinking in terms of model criteria instead of hardcoded model IDs. We benchmarked models across task types, domains, and complexity levels, and made routing decisions based on those profiles.
To estimate task type and complexity, we used NVIDIA’s Prompt Task and Complexity Classifier. It classifies prompts into categories like QA, summarization, code generation, and more. It also scores prompts along six dimensions such as creativity, reasoning, domain knowledge, contextual knowledge, constraints, and few-shots. From this it produces a weighted overall complexity score.
This gave us a structured way to decide when a prompt justified a premium model like Claude Opus 4.1 and when a smaller model like GPT-5-mini would perform just as well.
We have been experimenting with routing inference across LLMs, and the path has been full of wrong turns.
Our first attempt was to just use a large LLM itself to decide routing. It was too costly and the decisions were unreliable.
Next we tried training a small fine-tuned LLM as a router. It was cheaper, but the outputs were poor and not trustworthy.
Then we wrote heuristics to map prompt types to model IDs. That worked for a while, but it was brittle. Every API change or workload shift broke it.
Eventually we shifted to thinking in terms of model criteria instead of hardcoded model IDs. We benchmarked models across task types, domains, and complexity levels, and made routing decisions based on those profiles.
To estimate task type and complexity, we used NVIDIA’s Prompt Task and Complexity Classifier. It classifies prompts into categories like QA, summarization, code generation, and more. It also scores prompts along six dimensions such as creativity, reasoning, domain knowledge, contextual knowledge, constraints, and few-shots. From this it produces a weighted overall complexity score.
This gave us a structured way to decide when a prompt justified a premium model like Claude Opus 4.1 and when a smaller model like GPT-5-mini would perform just as well.
Now we are working on integrating this with Google’s UniRoute (https://arxiv.org/abs/2502.08773