> Agents are cardinality-hungry. They want the high-cardinality data you'd normally drop: individual trace IDs, per-request attributes, full tag sets. They are very patient. They will sift through it.
The agents themselves are not likely going to be doing the high cardinality queries or they will keel over. They have limited memory buffers. They will take many seconds to return results. They are likely going to be limited in terms of QPS.
From the blog:
> Apache Iceberg, with data stored as Parquet on S3, and most of the system implemented in Go
You have just ensured that queries will have a p99 >1 second. This is kind of antithetical to having an agent be fast.
You couldn't run any sort of real-time service, where hundreds of thousands to millions of events were occurring per second, and you needed to adjust to that in milliseconds.
The terms "p99" and "QPS" do not occur anywhere in the article. Which leaves the question of scalability to a user's imagination.
I applaud the direction. I am looking for objective evidence.
Author here - you are right that this architecture is not going to deliver super fast queries. But that's a tradeoff we're making: Agents don't need super fast queries to triage your software issues. In fact, the agents are extremely good at triage by fanning out to explore hypotheses against the telemetry. What they need is a datastore that allows them to run a _ton_ of queries in parallel, on the cheap. Data Lake architectures like ours provide exactly this.
At the end of the day, we're less focused on traditional database query metrics. We're optimizing for higher level outcomes, think mean time to remediation and such.
> The agents themselves are not likely going to be doing the high cardinality queries or they will keel over. They have limited memory buffers. They will take many seconds to return results. They are likely going to be limited in terms of QPS.
You're right that if you simply give an LLM a tool to query a massive high cardinality dataset, it's going to blow itself (its context window) up. That's not what we're doing: instead we harness the llm with purpose-built tools + prompt + context + other engineering to ensure the agent can explore the data and make progress, even if it does run a dumb query on occasion.
> Agents are cardinality-hungry. They want the high-cardinality data you'd normally drop: individual trace IDs, per-request attributes, full tag sets. They are very patient. They will sift through it.
The agents themselves are not likely going to be doing the high cardinality queries or they will keel over. They have limited memory buffers. They will take many seconds to return results. They are likely going to be limited in terms of QPS.
From the blog: > Apache Iceberg, with data stored as Parquet on S3, and most of the system implemented in Go
You have just ensured that queries will have a p99 >1 second. This is kind of antithetical to having an agent be fast.
You couldn't run any sort of real-time service, where hundreds of thousands to millions of events were occurring per second, and you needed to adjust to that in milliseconds.
The terms "p99" and "QPS" do not occur anywhere in the article. Which leaves the question of scalability to a user's imagination.
I applaud the direction. I am looking for objective evidence.
Author here - you are right that this architecture is not going to deliver super fast queries. But that's a tradeoff we're making: Agents don't need super fast queries to triage your software issues. In fact, the agents are extremely good at triage by fanning out to explore hypotheses against the telemetry. What they need is a datastore that allows them to run a _ton_ of queries in parallel, on the cheap. Data Lake architectures like ours provide exactly this.
At the end of the day, we're less focused on traditional database query metrics. We're optimizing for higher level outcomes, think mean time to remediation and such.
> The agents themselves are not likely going to be doing the high cardinality queries or they will keel over. They have limited memory buffers. They will take many seconds to return results. They are likely going to be limited in terms of QPS.
You're right that if you simply give an LLM a tool to query a massive high cardinality dataset, it's going to blow itself (its context window) up. That's not what we're doing: instead we harness the llm with purpose-built tools + prompt + context + other engineering to ensure the agent can explore the data and make progress, even if it does run a dumb query on occasion.