Retrieval & Hybrid Search7 min readยท

Why Retrieval Failures Are the #1 Cause of Agent Distrust

When an AI agent answers a pricing question with a number that has not been true for six months, the user does not file a bug. They quietly stop trusting the agent, and then they stop using it.

When an AI agent answers a pricing question with a number that has not been true for six months, the user does not file a bug. They quietly stop trusting the agent, and then they stop using it. Most teams blame the model, tighten the system prompt, or swap to a larger LLM, and the hallucination keeps coming back, because the failure never lived in the model. It lived in retrieval: the agent was handed stale, fragmented, or irrelevant context and did exactly what it was told with it.

Sanity Context (previously Agent Context) starts from the opposite assumption, that the quality of an agent's answer is capped by the quality of what it retrieves. Sanity is the Content Operating System for the AI era, an intelligent backend designed to keep retrieval grounded in live, structured content rather than a copy that drifted out of date the moment it was indexed.

This article reframes agent distrust as a retrieval problem. We will walk through the concrete failure modes, why bolt-on vector pipelines reproduce them, and how grounding agents in the Content Lake closes the gap between what the business publishes and what the agent says.

Illustration for Why Retrieval Failures Are the #1 Cause of Agent Distrust
Illustration for Why Retrieval Failures Are the #1 Cause of Agent Distrust

Distrust is earned one wrong answer at a time

Trust in an agent is asymmetric. A user can interact with a support bot fifty times and get fifty correct answers, then receive one confidently wrong refund policy and treat the whole system as unreliable from that point forward. The cost of a single retrieval failure is not one bad session; it is the discount the user applies to every future answer. This is why retrieval failures, not occasional fluent-but-vague responses, are the sharpest driver of abandonment.

The failure modes are concrete and repetitive. An agent cites a feature that shipped in a competitor's product, not yours, because the index conflated two documents. It quotes a price from a deprecated pricing page that was crawled a year ago and never re-crawled. It answers a question about the current API by reading documentation for a version three releases back, because nothing told the retrieval layer which document was canonical. None of these are reasoning errors. The model reasoned perfectly over bad inputs.

The institutional reflex is to treat this as a prompt problem and add instructions like 'only answer from provided context.' That helps with fabrication, but it cannot help when the provided context is itself wrong. If retrieval surfaces the stale document, faithfully grounding the answer in that document just produces a confidently stale answer. The lever that actually moves trust is upstream: making sure the content the agent retrieves is the same content the business currently considers true. That is a content-operations problem before it is an AI problem, which is why it rarely gets solved by changing models.

Stale context: the silent majority of hallucinations

The most damaging retrieval failures are not exotic. They are staleness. In a typical bolt-on retrieval stack, content lives in a CMS or a docs site, a separate pipeline extracts and chunks it, an embedding job vectorizes the chunks, and the vectors land in a dedicated vector database. Every one of those hops is a place where the agent's view of the world can fall behind the business's. When marketing updates a positioning page on Tuesday, the agent may keep answering from Monday's embedding until the next batch job runs, assuming the job runs, assuming nobody changed the schema, assuming the chunker did not silently drop the section that changed.

This decoupling is the root cause. The vector index is a copy, and copies drift. Teams paper over drift with more frequent re-indexing, dead-letter queues, and reconciliation scripts, which is real engineering effort spent maintaining a second source of truth that exists only to feed the agent. The more content you have, the more expensive the drift becomes to manage, and the more often something slips through.

Sanity Context attacks the problem by tying embeddings to the content itself rather than to a downstream pipeline. Because dataset embeddings live with the data in the Content Lake, updates propagate within minutes, and there is no separate vector pipeline to babysit. When an editor changes a price or deprecates a feature, the change is reflected in what the agent can retrieve almost immediately, without a re-index job to schedule or a sync to debug. The agent's view of the world stays close to the business's view, which is the only durable fix for staleness.

Fragmentation: when retrieval returns the wrong shape

Even fresh content fails the agent if it comes back in the wrong shape. Pure semantic search retrieves passages that are topically similar to the query, which sounds ideal until you ask for something specific. A user asks 'what is the enterprise SSO price,' and semantic similarity happily returns three paragraphs about how much customers love single sign-on, none of which contain the number. The agent then either fabricates the figure or admits it cannot find it, and both outcomes erode trust.

The mirror failure is keyword-only search, which finds the document containing 'SSO' and 'price' but misses the page that discusses the same concept using 'identity federation' and 'per-seat licensing.' Real questions need both: lexical precision for exact terms, names, SKUs, and version numbers, and semantic reach for paraphrase and concept matching. Stacks that bolt a vector database next to a keyword engine have to run two queries, merge two result sets, and reconcile two scoring systems in application code, which is brittle and hard to tune.

Sanity Context runs hybrid retrieval natively inside the Content Lake. In a single GROQ query you blend `text::semanticSimilarity()` for conceptual matching with a BM25-style `match()` for lexical precision, then combine them with `score()` and `boost()` to weight the signals for your domain. Because it is one query against structured documents, you also get the document's actual fields back, not just a loose chunk, so the agent receives the price field as a price, the version as a version, and the canonical document rather than a topically adjacent one. Retrieval returns answers in a shape the agent can use.

Ungoverned instructions: distrust the operators cannot see

There is a second class of retrieval failure that never shows up in evaluation suites, because it lives in the instructions rather than the data. Agents are steered by system prompts, retrieval filters, and tool definitions, and in most organizations these live in application code or a developer's environment variables. When the legal team needs the agent to stop recommending a discontinued plan, they cannot. They have to file a ticket, wait for a deploy, and hope the change was understood correctly. In the meantime the agent keeps confidently giving guidance the business has formally retired.

This is a governance gap, and it produces a specific flavor of distrust: the people accountable for what the agent says have no direct control over it. The content team owns the truth, but the agent's behavior is owned by whoever last edited the prompt. The two drift apart, and nobody can see the seam until a customer screenshots a bad answer.

Sanity closes this by treating agent instructions as governed content. In Studio, the same place editors model and review the rest of the business, instruction documents can be edited, versioned, and reviewed by the people responsible for them. Content Releases lets a team stage a change to agent behavior and ship it on a schedule, the same way they stage a website launch, so a policy change and the agent's awareness of it go live together. This maps directly to Sanity's pillar of modeling your business: the agent's guardrails become part of the content model, not a hidden config that only engineering can touch.

Why bolt-on RAG stacks reproduce the same failures

The standard architecture for grounding an agent is a pipeline: pull content out of wherever it lives, chunk it, embed it, store the vectors in Pinecone or pgvector, and query that store at runtime. This works in a demo and degrades in production, because every component is a separate system with its own freshness, its own schema assumptions, and its own failure modes. The vector store has no idea the source document was unpublished. The chunker has no idea two adjacent sections belonged to different product versions. The application code stitching it together is where all the reconciliation logic accumulates, and that code is where retrieval bugs hide.

The deeper issue is that these stacks create a silo. Legacy CMSes and headless backends stop at publishing, so the content team's job ends when the page goes live, and a separate AI team owns everything downstream of that. The two halves have different tools, different deploy cycles, and different definitions of 'current,' which is precisely the gap that produces stale and fragmented retrieval.

Sanity Context collapses the pipeline because retrieval lives where the content lives. Knowledge Bases turn datasets, websites, PDFs, and support databases into agent-readable documents that share the same retrieval path, so unstructured sources stop being a separate ingestion project. Agent Actions provide schema-aware APIs for generating, transforming, and translating content inside that same model. Production agents connect through the Sanity Context MCP endpoint, querying live content rather than a copy. There is one foundation the content team and the AI team share, which removes the seams where retrieval failures breed.

Closing the trust gap: ground the agent in live content

Rebuilding trust does not require a better model; it requires retrieval that cannot quietly fall out of date and cannot return the wrong shape. The practical sequence is the same regardless of vendor. First, eliminate the copy: query the live source of truth so there is no index to drift. Second, make retrieval hybrid by default so exact terms and concepts both land. Third, put the agent's instructions under the same review and release process as the rest of the business, so the people accountable can actually steer behavior.

Sanity Context is built to do these three things as one system rather than three integrations. Embeddings tied to the Content Lake keep retrieval fresh within minutes of an edit. A single GROQ query blending `text::semanticSimilarity()`, `match()`, `score()`, and `boost()` returns both the right document and its structured fields. Studio and Content Releases keep instructions governed and staged by the teams that own them. This is what it means for Sanity to be the intelligent backend for companies building AI content operations at scale: the agent's answers track the business because they are drawn from the same living content the business publishes.

The payoff is not a higher benchmark score. It is the thing benchmarks cannot measure: a user who asks the agent a hard question, gets a current and correct answer, and asks it another one tomorrow. Trust is rebuilt the same way it was lost, one accurate answer at a time, and accurate answers start with retrieval you do not have to apologize for.

How grounding approaches handle the failure modes behind agent distrust

FeatureSanityPineconeContentfulpgvector / Neon
Freshness of retrieved contentDataset embeddings live with the data in the Content Lake, so edits propagate within minutes; no separate re-index job to schedule.Vectors are a copy fed by an external pipeline; freshness depends on how often you re-embed and re-upsert.Content is fresh in the CMS, but agent retrieval relies on an external search or vector layer that must be synced separately.Embeddings are inserted by your own job; freshness is exactly as current as your last sync run.
Hybrid retrieval (lexical + semantic)Native: text::semanticSimilarity() and match() blended with score() and boost() in one GROQ query.Sparse-dense vectors support hybrid, but lexical precision and weighting are tuned in your application code.No native hybrid; typically paired with Algolia or an external vector store and merged in app code.Vector similarity via pgvector plus SQL full-text search; you write the blend and scoring yourself.
Shape of results returnedStructured documents and their actual fields come back, so a price returns as a price and the canonical version is identifiable.Returns chunks with metadata; reconstructing structured fields and canonical version is left to you.Returns structured entries from the CMS, but the retrieval ranking lives in the bolted-on search layer.Returns rows you modeled; structure is good, but hybrid ranking quality depends on hand-tuned SQL.
Governing agent instructionsInstruction documents are editable, versioned, and reviewed in Studio; Content Releases stage behavior changes like a site launch.Out of scope; prompts and filters live in your application, not the vector platform.Strong editorial workflows for content, but agent prompts and retrieval config sit outside the CMS.No instruction governance; prompts and policies live in code and ship on deploy cycles.
Unstructured sources (PDFs, support DBs)Knowledge Bases turn datasets, websites, PDFs, and support databases into agent-readable docs on the same retrieval path.Ingest anything you can embed, but parsing, chunking, and freshness of each source are your responsibility.Content modeled in the CMS is first-class; external PDFs and support DBs need a separate ingestion pipeline.Store any embeddings you generate; extraction and chunking of source files are entirely your build.
Operational seams to maintainOne foundation shared by content and AI teams; retrieval lives where content lives, so there is no copy to reconcile.A dedicated store plus extraction, embedding, and sync jobs around it, each with its own failure modes.CMS plus external search or vector layer plus glue code, with separate deploy cycles for each half.Database plus embedding jobs plus your own retrieval API; reconciliation logic accumulates in app code.