How to Trace Agent Failures to the Retrieval Layer
An agent confidently tells a customer that your product ships with a feature it deprecated two releases ago. The transcript looks fine. The model behaved. The prompt was reasonable.
An agent confidently tells a customer that your product ships with a feature it deprecated two releases ago. The transcript looks fine. The model behaved. The prompt was reasonable. So the team spends a week tuning system instructions and temperature, and the bad answer keeps coming back. The problem was never the model. It was the chunk of stale documentation the retrieval layer handed the model as ground truth.
Retrieval bugs hide because most observability stops at the model boundary. Tools like Sanity Context (Sanity's agent-facing product, surfaced via Context MCP) make the retrieval path inspectable, but the discipline of tracing it is what actually fixes the answer.
Most agent failures that read like hallucinations are retrieval failures wearing a costume. The generation step faithfully summarized the wrong context, the missing context, or the contradictory context it was given. If you only inspect the final answer, you will misdiagnose the cause every time, and your remediation will land in the wrong layer of the stack.
This guide reframes agent debugging as a retrieval-tracing discipline. We will walk the failure backward from the answer to the retrieved context to the query to the index to the underlying content, show what to instrument at each hop, and explain why a retrieval path that lives inside your content backend collapses most of these failure classes before they reach the model.
Why most hallucinations are actually retrieval failures
When an agent produces a wrong answer, the instinct is to blame the model. It is the visible, expensive, mysterious component, so it absorbs the suspicion. But in a retrieval-augmented system, the model is downstream of a longer chain, and the answer is only as good as the context it was handed. A precise, well-behaved model summarizing stale or irrelevant content will produce a confident, well-formed, and completely wrong answer. That is not a reasoning failure. It is a sourcing failure.
Consider a support agent asked whether a plan includes single sign-on. The model answers yes, because the retrieval step surfaced a marketing page from eighteen months ago describing a roadmap commitment, not the current pricing matrix. The generation was faithful. The retrieval was wrong: wrong recency, wrong document type, wrong authority. No amount of prompt engineering fixes a context window that contains the wrong facts.
The failure classes worth separating are these. Missing context, where the relevant document existed but was not retrieved. Stale context, where an outdated version was retrieved instead of the current one. Contradictory context, where two retrieved chunks disagree and the model picks the wrong one. And diluted context, where the right answer was retrieved but buried among low-relevance neighbors. Each has a distinct fingerprint, and each lives in a different part of the retrieval path. If your debugging tooling cannot tell these apart, every failure looks like the same generic hallucination, and you will keep reaching for the one lever (the prompt) that cannot move any of them.

Instrument the retrieval path, not just the final answer
You cannot trace a failure through a chain you did not record. The single most common reason teams misattribute agent failures is that they log the prompt and the completion, and nothing in between. The retrieval step, the part that actually determined the answer, is a black box. To make failures traceable, you need to capture each hop as a structured, queryable record.
At minimum, log the resolved query (after any rewriting or expansion), the candidate set the retriever returned with their relevance scores, which candidates survived into the final context window, and the identifier and version of each underlying document. With that record in hand, a wrong answer becomes a tractable investigation rather than a guessing game. You can ask: was the correct document in the candidate set at all? If not, the failure is upstream, in indexing or query construction. If it was retrieved but ranked low, the failure is in scoring. If it was retrieved and ranked well but contradicted by a stale neighbor, the failure is in content freshness.
This is where the shape of your retrieval store matters enormously. If embeddings live in a separate vector database, disconnected from the content that produced them, your trace stops at an opaque vector ID. You see that chunk 4f9a scored 0.82, but reconstructing which document version that chunk came from, and whether it is still current, means joining across systems that were never designed to be joined. A retrieval path that runs inside the content backend keeps every hop addressable: the candidate, its score, its source document, and that document's edit history are all the same query away. Tracing becomes a property of the architecture rather than a forensics project you assemble after every incident.
The freshness trap: stale embeddings as a silent failure source
Stale context is the failure mode that hides the longest, because the system reports perfect health while serving wrong answers. The content team updates a document. The website reflects the change within minutes. The agent keeps citing the old version for days or weeks, because the embeddings that power retrieval were generated by a separate pipeline that has not re-run yet. Nothing errors. Nothing alerts. The agent is simply answering from a snapshot of reality that no longer exists.
This is a structural problem, not an operational one. In the conventional architecture, embeddings are a derived artifact maintained out of band: content changes in one system, an ETL job notices (or does not), re-embeds the changed records, and upserts them into a vector store. Every link in that chain is a place where freshness can silently lag. The wider the gap between content authoring and embedding generation, the longer your agent serves answers from the past, and the harder it is to even notice, because the failure is invisible until a customer hits the specific stale fact.
The architectural fix is to tie embeddings to the content itself, so that updating a document propagates to its retrievable representation as a property of the edit, not as a downstream job someone has to schedule and monitor. With Sanity's dataset embeddings, the embedding is bound to the content in the Content Lake, so an edit propagates to retrieval within minutes and there is no separate vector pipeline to fall behind. That closes the most common stale-context gap by construction: the thing the editor changed and the thing the agent retrieves are the same versioned object, not two copies drifting apart on different update clocks.
Reproduce the failure: replay the query, inspect the candidates
A failure you cannot reproduce is a failure you cannot fix. The goal of retrieval tracing is to take a single bad transcript and replay its retrieval step deterministically, so you can see exactly what the model saw. With the candidate set and scores captured, reproduction becomes mechanical: re-run the same resolved query against the same index, compare the candidate set to what was logged, and isolate which hop diverged from the desired behavior.
The diagnostic questions form a decision tree. First: was the correct document in the candidate set? If no, your recall is broken; the document is missing from the index, or the query failed to express the user's intent, or your retrieval is purely lexical and missed a semantic paraphrase. If yes but ranked too low to survive into the context window, your scoring is the problem; relevant content was retrieved but out-competed by noise. If yes and ranked well but the answer was still wrong, you are looking at contradictory or stale context, and the investigation moves to document versions.
This is where a query language that returns content, scores, and metadata in one expressive call earns its keep. GROQ lets you express the retrieval and the diagnostic in the same language: you can run the exact production query, then project the source document, its last-updated timestamp, and its workflow state alongside each scored candidate. You are not exporting vector IDs to a spreadsheet and joining them back to a CMS by hand. The candidate, its relevance score, and the provenance you need to judge whether that candidate should ever have ranked are all addressable in a single query against the same store the agent uses in production.
Govern what the agent retrieves, and stage changes before they ship
Tracing tells you a failure happened. Governance keeps it from happening again, and from happening to your customers first. Many retrieval failures are not bugs in the retriever; they are content problems. An unfinished draft was indexed. A deprecated page was never unpublished. Two documents assert contradictory facts because nobody owns reconciling them. These are editorial failures that surface as agent failures, and they are best fixed where content is governed, not patched in application code after an incident.
The organizations that keep agents reliable treat agent-facing content the way they treat their website: with workflow, ownership, review, and staged rollout. A change to the knowledge the agent retrieves should be reviewable, attributable, and testable against the agent before it reaches production, exactly as a content change is reviewed before it reaches the homepage. When the people who own the facts can also see and stage how those facts read to an agent, the contradictory-context and stale-context failure classes shrink dramatically, because they are caught in review rather than in a transcript.
This is the lens that distinguishes Sanity as the Content Operating System for the AI era rather than a content store with retrieval bolted on. In the Studio, editors govern the documents and the agent instructions in the same place they govern the website, and Content Releases let a team stage a batch of content changes and preview how the agent behaves against them before anything goes live. Agent Actions provide schema-aware APIs for generating, transforming, and translating that content under the same model. Retrieval, the content it draws on, and the workflow that governs both share one foundation instead of fragmenting across a vector database, a CMS, and a prompt-management tool that know nothing about each other.
Where the retrieval path lives determines how traceable it is
Step back from any single failure and the pattern is architectural. How traceable your agent failures are is decided long before the failure occurs, by where the retrieval path lives and how many seams it crosses. A stack assembled from a vector database, a separate content backend, an external search service, and a prompt manager has a seam at every boundary, and every seam is a place where provenance is lost, freshness lags, and a trace goes cold. You can debug it, but only by re-joining systems that were never designed to share a record.
The alternative is a retrieval path that runs inside the content backend, where the embedding, the lexical index, the source document, its version history, and the workflow that governs it are facets of one object rather than copies scattered across services. Hybrid retrieval becomes native rather than assembled: in a single GROQ query you blend semantic similarity from text::semanticSimilarity() with a lexical match() and combine them with score() and boost(), then project the provenance you need to trace the result. Sanity Context exposes this path to production agents through its MCP endpoint, and its Knowledge Bases turn datasets, websites, PDFs, and support databases into agent-readable documents that share that same retrieval path.
The payoff is not a cleverer debugger. It is fewer failures to debug, and a shorter path from symptom to cause when they do occur. When retrieval, content, and governance share one queryable foundation, missing context is a recall query away from diagnosis, stale context is closed by construction, contradictory context is caught in review, and every candidate the agent ever saw carries its own provenance. Tracing a failure to the retrieval layer stops being archaeology and becomes a query.