Why Agent Eval Should Start With Retrieval, Not the LLM

Your agent confidently tells a customer that a discontinued plan still includes phone support. The model did everything right. It reasoned cleanly, cited a source, and produced fluent prose. The problem was upstream: the retrieval layer handed it a two-year-old policy document, and no amount of prompt engineering or model swapping was going to fix that. Teams keep grading the LLM when the failure was in what the LLM was fed.

This is the most common, and most expensive, mistake in agent evaluation. Eval budgets pour into model benchmarks, temperature sweeps, and prompt rewrites, while the retrieval step that determines whether the right facts even reach the context window goes unmeasured. Sanity Context is the AI Content Operating System's retrieval layer for agents, an intelligent backend that grounds them in real, structured, current content so you can evaluate what they were given before you blame what they did with it.

This article reframes agent eval from the bottom of the stack up. We will argue that retrieval quality sets the ceiling on every downstream metric, show you how to measure it directly, and connect that discipline to the content infrastructure that makes good retrieval repeatable rather than accidental.

The failure is usually in the context, not the completion

When an agent gets an answer wrong, the instinct is to interrogate the model. Was the prompt clear? Was the temperature too high? Should we upgrade to a larger model? These questions assume the model had what it needed and fumbled the reasoning. In production RAG systems, that assumption is wrong far more often than it is right.

Consider the structure of a retrieval-augmented answer. The agent receives a user query, a retrieval step pulls candidate passages from a knowledge store, those passages are assembled into the context window, and only then does the model generate. Three of those four stages happen before the LLM does anything. If the retrieval step returns a stale policy, an irrelevant product page, or nothing at all, the model is reasoning over garbage. A perfect model produces a confident, well-cited, completely wrong answer. That is the worst possible failure mode, because it looks like success.

This is why retrieval should be the first thing you evaluate, not the last. The model's ceiling is fixed by what reaches its context window. You can spend a quarter fine-tuning prompts and swapping foundation models and move your end-to-end accuracy by a few points, or you can fix the retrieval step that was silently capping every one of those experiments. The teams that ship reliable agents measure the inputs before they argue about the outputs. They treat the context window as the unit of analysis, not the completion.

Illustration for Why Agent Eval Should Start With Retrieval, Not the LLM

Retrieval quality sets the ceiling on every downstream metric

There is a clean mental model here: end-to-end agent accuracy is bounded above by retrieval recall. If the correct fact is not in the retrieved set, the probability that the model produces the correct answer drops to whatever it can guess from parametric memory, which for your proprietary product, pricing, and policy data is effectively zero. No prompt rescues an answer whose evidence was never retrieved.

This bound has a practical consequence for how you allocate eval effort. Suppose your agent answers 72 percent of support questions correctly. Before you touch the model, decompose that number. Run your eval set through the retrieval step alone and ask a narrower question: for what fraction of queries did the correct passage appear in the top results? If retrieval recall is 80 percent, your model is already converting most of what it gets, and further model work has a hard ceiling 8 points away. If retrieval recall is 95 percent but accuracy is 72, now you have a genuine generation problem worth chasing. Without this decomposition you are guessing.

The corollary is that retrieval is also where the cheapest wins live. Improving how documents are chunked, how queries are expanded, or how lexical and semantic signals are blended often moves end-to-end accuracy more than a model upgrade, at a fraction of the cost and with no change to your inference bill. Retrieval is tunable, measurable, and yours to control in ways the model's weights are not.

How to actually measure retrieval in isolation

Evaluating retrieval as its own component requires a labeled set and a few unglamorous metrics. Start by assembling a set of representative queries paired with the document or passage that genuinely answers each one. This is the hard, human part, and it is non-negotiable; without ground-truth relevance judgments you cannot say whether retrieval is working.

With that set in hand, the core metrics are recall at k, the fraction of queries where the right passage lands in the top k results, and a rank-sensitive measure such as mean reciprocal rank or normalized discounted cumulative gain, which rewards putting the right passage near the top rather than merely somewhere in the window. Recall at k tells you whether the evidence is present at all; the rank metrics tell you whether it arrives early enough to survive context-window truncation and to dominate the model's attention.

The diagnostic that surprises teams most is the failure taxonomy. Bucket your retrieval misses: was the answer simply absent from the knowledge store, present but stale, present but chunked so the relevant sentence was split across boundaries, or present and correctly chunked but out-ranked by a near-duplicate? Each bucket points to a different fix, and none of them is a model fix. A missing answer is a content-coverage problem. A stale answer is a freshness problem. A split answer is a chunking problem. An out-ranked answer is a scoring problem. Eval that stops at one accuracy number tells you that you have a problem; eval that buckets retrieval failures tells you what to do on Monday.

Why hybrid retrieval is the difference between recall and noise

Once you start measuring retrieval failures, a pattern emerges fast. Pure semantic search, vectors and nothing else, misses exact matches: a product SKU, an error code, a version number, a proper noun the embedding model never learned. Pure keyword search misses paraphrase: the user asks about "cancelling" and the document says "terminating your subscription." Each approach has a failure class the other covers. Hybrid retrieval, blending lexical and semantic signals, is how you raise recall without drowning the model in noise.

This is where the retrieval layer's architecture stops being incidental. In Sanity, hybrid retrieval is native to the Content Lake rather than something you assemble from separate systems. A single GROQ query blends `text::semanticSimilarity()` for meaning with a BM25-style `match()` for exact terms, combined with `score()` and `boost()` so you tune how the two signals trade off, all in one query against your content. You are not running a vector database alongside a search engine alongside your CMS and reconciling three sets of results in glue code; the blend happens where the content already lives.

That architecture matters for eval specifically because it collapses the surface area you have to measure. When retrieval is one query against one store, your failure taxonomy maps to knobs you can actually turn: adjust the boost, refine the chunking, expand the query. When retrieval is three stitched systems, a recall miss could originate in any of them, and isolating the cause becomes its own investigation. Native hybrid retrieval makes the eval loop tight enough to run every day.

Stale content is a retrieval failure your dashboard won't catch

The most insidious retrieval failures pass every test that ignores time. Your eval set was labeled in March. The pricing page changed in June. Now the correct passage in your ground truth is itself wrong, and your retrieval metrics look fine while your agent confidently quotes a price that no longer exists. Freshness is a dimension of retrieval quality that conventional eval harnesses, which treat the corpus as static, systematically miss.

In most stacks, freshness is a pipeline problem. Content lives in a CMS, embeddings live in a separate vector store, and a sync job, often nightly, often flaky, copies changes from one to the other and re-embeds them. Every hour between a content edit and a re-embed is an hour your agent can retrieve the old version with full confidence. The gap is invisible because nothing errors; the system simply serves yesterday's truth.

This is a structural advantage of tying embeddings to content rather than maintaining them as a separate artifact. With dataset embeddings in Sanity, embeddings are bound to the content itself, so when an editor publishes a change the update propagates within minutes, with no separate vector pipeline to fall behind or break. Editors govern what agents can retrieve from the Studio, and they can stage changes with Content Releases, evaluating how the agent will behave against new content before it goes live, the same way they stage a website. Freshness stops being an operational race against a cron job and becomes a property of the content layer.

Building retrieval eval into the content operating system

Treating retrieval as the first-class object of evaluation changes where eval lives. It stops being a notebook a single ML engineer runs before a release and becomes a continuous discipline owned across the team, because the levers that move retrieval (coverage, freshness, structure, governance) are content levers, not model levers.

This is the argument for grounding agents in a Content Operating System rather than a pile of disconnected infrastructure. Sanity Context turns your datasets, websites, PDFs, and support databases into agent-readable Knowledge Bases that share one retrieval path, the same hybrid GROQ query against the Content Lake. Production agents connect through the Sanity Context MCP endpoint, so the thing you evaluate in your harness is the thing your agents actually query in production. There is no drift between your eval environment and your serving environment because they read from the same store.

That shared foundation is what makes retrieval eval repeatable instead of heroic. When content is structured and queryable, your ground-truth set can reference the same documents your agent retrieves. When embeddings track content, your freshness checks reflect reality. When editors govern agent instructions in the Studio and stage them with Content Releases, the people closest to the content can catch a coverage gap before it becomes a hallucination in front of a customer. Legacy stacks force you to scale headcount to keep retrieval honest across silos; an integrated content layer scales the output instead. Start your agent eval at retrieval, and put retrieval somewhere you can actually measure, govern, and keep current.

Where retrieval eval gets measured: native content layer vs. assembled stacks

Feature	Sanity	Pinecone	Contentful	pgvector / Neon
Hybrid retrieval	Native: text::semanticSimilarity() + match() blended with score() and boost() in one GROQ query against the Content Lake.	Sparse-dense vectors support hybrid, but lexical and content joins are assembled in app code outside the index.	No native hybrid search; teams wire an external search service via the App Framework and reconcile results downstream.	Vector similarity plus full-text tsvector are both available, but blending and scoring them is hand-rolled SQL you own and tune.
Embedding freshness	Dataset embeddings are tied to content, so an editor's publish propagates within minutes with no separate vector pipeline to lag.	Freshness depends on your own sync job re-embedding changed content; the index has no view of source-content edits.	Content changes are live in the CMS, but any embeddings live in an external store fed by a pipeline you maintain.	Re-embedding on content change is a trigger or job you build; the database does not track source freshness for you.
Eval / serving parity	Agents query through the Sanity Context MCP endpoint, the same store and path your eval harness reads, so no environment drift.	Eval and serving can both hit the index, but the surrounding retrieval glue must be replicated to match production.	Eval must reproduce the external search and assembly layer, which lives outside Contentful, to match what agents see.	Parity is achievable since it is one database, though the retrieval logic around it must be kept identical in both paths.
Failure-taxonomy isolation	One query against one store means recall misses map to specific knobs: boost, chunking, or coverage, not cross-system forensics.	A miss could be in embedding, the index, or the lexical layer you bolted on, so isolating the cause spans systems.	Misses can originate in the CMS query or the separate search service, requiring tracing across both to localize.	Single database narrows the surface, but vector and full-text paths are distinct queries to diagnose separately.
Editorial governance of retrieval	Editors govern what agents retrieve in the Studio and stage changes with Content Releases before they reach production.	Vector database is infrastructure with no editorial surface; content owners cannot inspect or stage what agents see.	Editors manage content, but the AI retrieval configuration lives in external services outside their workflow.	No editorial layer; retrieval behavior is governed entirely in code and database administration, not by content owners.
Unstructured source ingestion	Knowledge Bases turn datasets, websites, PDFs, and support databases into agent-readable docs on the same retrieval path.	Accepts any vectors you produce, but parsing and chunking PDFs or sites is a separate pipeline you build and own.	Designed for structured editorial content; ingesting external PDFs or support databases for retrieval is custom work.	Stores whatever you insert; document parsing, chunking, and ingestion are entirely your application's responsibility.