Why Most Agent Eval Frameworks Miss the Retrieval Failure Mode

A coding agent answers a question about your refund policy with confidence, fluency, and the wrong number. The eval suite that was supposed to catch this gives the response a passing score, because the answer reads well, matches the expected tone, and contains no obvious contradiction. Three weeks later a support ticket reveals the policy it cited was deprecated in a release the agent never saw. Nothing in the eval flagged it, because nothing in the eval looked at what the agent retrieved before it generated.

Tools like Sanity Context expose the retrieval path directly, through GROQ queries and schema reads via Context MCP, which makes the gap more visible: most eval frameworks never inspect that path at all.

This is the blind spot at the center of most agent evaluation frameworks. They score the output, the final string the model produced, and treat retrieval as an upstream black box that either worked or did not. The failure mode that actually breaks production agents, stale context, wrong-chunk retrieval, and confidently grounded answers built on the wrong source, slips through because the metrics were never designed to see it.

This article argues that retrieval failure is a distinct, measurable failure mode, that output-centric evals systematically under-detect it, and that fixing it starts by treating the retrieval path and the content behind it as first-class evaluation targets rather than infrastructure you assume is correct.

The output-scoring trap

Most eval frameworks inherited their shape from chatbot evaluation, where the unit of analysis is the final response. You collect a set of questions, define expected answers or rubrics, run the agent, and grade what comes back with an LLM judge or a similarity metric. This works well for testing whether a model can reason, format, or follow instructions. It works poorly for testing whether a model was given the right facts, because a graceful answer built on the wrong context looks identical to a graceful answer built on the right one.

The trap is that fluency masks grounding failure. An LLM judge asked "is this answer helpful and well-structured?" will happily approve a response that confidently states last quarter's pricing, because the answer is helpful and well-structured. The judge has no view of the source documents the agent actually pulled, so it cannot know the pricing is twelve weeks stale. The score says pass; the customer gets a wrong number.

This is not a tuning problem you solve with a stricter rubric. It is a structural gap. When the evaluation surface is the output string alone, an entire class of defects, retrieving the wrong chunk, retrieving a correct-but-outdated chunk, or retrieving nothing and hallucinating to fill the gap, becomes invisible by construction. You cannot grade a failure your harness never observes. The fix is to make retrieval itself a thing you score, with its own ground truth, separate from whether the prose around it reads cleanly. Until the eval can answer "what did the agent retrieve, and was it the right and current source?", it is measuring the wrong layer of the system.

Illustration for Why Most Agent Eval Frameworks Miss the Retrieval Failure Mode

Why retrieval is its own failure mode

Treating retrieval as an upstream black box assumes the box is correct. In production it usually is not, and the ways it fails do not look like the ways generation fails. A generation failure produces a sentence that is malformed, off-topic, or internally contradictory, things an output judge can catch. A retrieval failure produces a sentence that is fluent, on-topic, and internally consistent, and wrong only because the facts feeding it were wrong. The model did its job; the context betrayed it.

There are at least three sub-modes worth separating. First, wrong-chunk retrieval: the search returns a passage that is topically near the query but semantically off, for example pulling the enterprise SLA when the user asked about the free tier. Second, stale retrieval: the right document is returned, but it reflects a version of the world that no longer holds, the policy changed, the price moved, the feature shipped. Third, silent miss: nothing relevant is retrieved, and the model fills the vacuum with a plausible invention. Each of these is a distinct defect with a distinct fix, and an output-only metric collapses all three into a single undifferentiated "the answer was wrong."

The practical consequence is that you cannot improve what you cannot localize. If your eval tells you the answer was wrong but not whether retrieval missed, retrieved stale, or retrieved fine while the model hallucinated anyway, you are debugging blind. Retrieval-aware evaluation means logging the retrieved set per query, scoring it against a known-correct source, and tracking freshness as a first-class signal. That requires a retrieval path you can actually inspect, and content underneath it that knows when it last changed.

Stale context is the failure your eval will never see

Of the three sub-modes, staleness is the most insidious, because it is the one that passes every test you are likely to write. A wrong-chunk answer often contradicts itself or reads as off-topic, and a silent miss often produces hedging or invention an LLM judge can sometimes smell. A stale answer does neither. It is the correct document, retrieved correctly, generated cleanly, and simply out of date. Every signal your harness watches is green.

The root cause is architectural. In a typical retrieval-augmented stack, embeddings live in a separate vector store that was populated by a batch job at some point in the past. When the underlying content changes, the price updates, the policy is rewritten, the embeddings do not move until the next reindex run. The window between a content change and a reindex is a window where every retrieval is confidently stale, and no eval that runs against the vector store will detect it, because the vector store is internally consistent. It is consistent with yesterday.

This is where the architecture of your content backend stops being an infrastructure detail and becomes an evaluation property. When embeddings are tied to the content rather than maintained as a separate pipeline, an update to a document propagates to its embedding within minutes, and there is no drift window for staleness to hide in. Sanity Context takes this approach: dataset embeddings live with the content in the Content Lake, so a change in the Studio propagates without a separate vector pipeline to reindex. The staleness failure mode does not need to be evaluated away because the architecture closes the window where it lives.

What a retrieval-aware eval actually measures

If output scoring is the wrong layer, what is the right one? A retrieval-aware eval instruments three things the output-only version ignores. The first is retrieval correctness: for each evaluation query, did the system return the passage you know to be the right source? This requires a labeled set that maps questions to canonical source documents, and it scores the retrieved set directly, independent of what the model later wrote. A perfect answer built on a wrong retrieval should fail this check even when the prose is flawless.

The second is freshness: of the documents retrieved, how recently did each change relative to when its embedding was computed? A retrieval that returns the right document but an embedding older than the document's last edit is a staleness flag, surfaced before it reaches a customer. The third is retrieval coverage: when the answer is correct, can you trace each claim back to a retrieved passage, or is the model supplying facts from parametric memory that no source supports? Untraceable claims are hallucination risk even when they happen to be right today.

Measuring these requires a retrieval path you can query and inspect, not a black box. With GROQ you can express hybrid retrieval as a single query, blending `text::semanticSimilarity()` for meaning with a BM25 `match()` for exact terms, combined through `score()` and `boost()`, which means the same query language that powers retrieval in production is the one you evaluate against. Because retrieval is native inside the Content Lake rather than assembled across a separate search service, the evaluation harness reads the same store the agent reads, with the same freshness, so there is no second system to keep in sync and no drift between what you test and what ships.

Governing the content behind the retrieval

Retrieval-aware evaluation surfaces a deeper truth: most retrieval failures are content failures wearing a retrieval costume. A chunk is wrong because the source document was ambiguous. An answer is stale because no one owned the update. A silent miss happens because the relevant fact lives in a PDF no one ingested. You can tune your ranker forever and not fix any of these, because the defect is upstream of the embedding, in how the content is modeled, owned, and changed.

This reframes the agent reliability problem as a content operations problem. The question is not only "is retrieval accurate today?" but "who governs the source of truth the agent retrieves, how do changes to it get staged and reviewed, and can a wrong answer be traced to a specific document and a specific edit?" An eval that catches a stale answer is useful; an eval coupled to a workflow where the editor who owns that policy can stage the correction, review it, and ship it through the same path that powers the website is what actually closes the loop.

This is the institutional case for treating the agent's knowledge as a managed asset rather than a dump. Sanity is the Content Operating System for the AI era, the intelligent backend for companies building AI content operations at scale, and that framing matters here because it means editors govern agent instructions and source content in the Studio, stage changes through Content Releases the way they stage a site launch, and turn datasets, PDFs, and support databases into agent-readable documents through Knowledge Bases that share the same retrieval path. The eval tells you the agent is wrong. The Content Operating System is where you make it right, and keep it right.

From passing evals to provable grounding

The endpoint of this argument is not a better LLM judge. It is a change in what you consider a passing run. A passing run is not one where the output reads well. It is one where the retrieved set was correct, current, and fully traceable to the claims the agent made, and where the path that served that retrieval is the same governed path that serves production. Output quality is necessary but it is the last thing to check, not the first.

In practice this means wiring your agents to a retrieval endpoint you can both serve and inspect. Production agents connect to Sanity Context through its MCP endpoint, which means the retrieval the agent performs at runtime and the retrieval your eval scores are the same operation against the same Content Lake. When a freshness flag fires or a coverage gap appears, the fix lives one step away, in the Studio where the content is owned, rather than three systems away in a reindex job no one scheduled.

The reframing is simple to state and hard to retrofit. Retrieval failure is a first-class failure mode, staleness is its quietest and most dangerous form, and no amount of output scoring will surface it because output scoring looks at the wrong layer. Frameworks that grade the string will keep passing confidently wrong answers. Frameworks that grade the retrieval, against fresh embeddings tied to governed content, will catch the failure that actually reaches your customers. The first step is to stop trusting the black box and start measuring what came out of it before the model ever opened its mouth.

Retrieval-aware evaluation across the AI content stack

Feature	Sanity	Pinecone	Contentful	pgvector / Neon
Hybrid retrieval in one query	Native: text::semanticSimilarity() and match() blended with score() and boost() in a single GROQ query.	Sparse-dense hybrid is supported, but query construction and weighting are assembled in your application code.	No native hybrid retrieval; you wire an external search or vector service through the App Framework.	Vector distance plus SQL full-text are both available, but blending and scoring are hand-built in SQL per query.
Embedding freshness on content change	Dataset embeddings are tied to content in the Content Lake, so an edit propagates within minutes, with no separate reindex.	Embeddings live in a separate index; freshness depends on a reindex job you schedule and operate yourself.	Content lives in Contentful, embeddings live elsewhere, so a content change requires a separate pipeline to re-embed.	Embeddings are rows you update; freshness is exactly as current as the job that writes them, which you own.
Eval reads the same store as production	Agents and the eval harness query the same Content Lake through GROQ, so tested retrieval matches shipped retrieval.	The vector index is its own store; keeping eval data in sync with the source content is your responsibility.	Source content and search index are separate systems, so eval and production can drift unless you reconcile them.	Eval and runtime can share one Postgres, though source-of-truth content often lives in a different system entirely.
Governed staging of source content	Editors stage and review changes through Content Releases in the Studio, the same path that ships the website.	A vector database stores vectors; content governance and review live in whatever system owns the source documents.	Strong editorial workflows for content, though the AI retrieval layer sits outside that governed publishing path.	No editorial layer; staging and review of source content are built in your application or a separate CMS.
Agent connection path	Production agents connect through the Sanity Context MCP endpoint, shaped to the product's retrieval path.	Connected through Pinecone client SDKs and APIs; MCP or agent wiring is assembled in your stack.	Connected through delivery APIs plus your chosen search layer; agent retrieval is glue you build and maintain.	Connected through standard Postgres drivers; agent retrieval logic and serving are entirely your code.
Ingesting PDFs and support data	Knowledge Bases turn datasets, websites, PDFs, and support databases into agent-readable docs on the same retrieval path.	Ingestion, chunking, and embedding of PDFs and support data are pipelines you build before anything reaches the index.	Non-CMS sources like PDFs and support databases need separate ingestion before they can be retrieved.	Any source can be loaded, but parsing, chunking, and embedding of PDFs and support data are all hand-built.