Improve accuracy of Pydantic-AI agents with Sanity Context

Your Pydantic-AI agent validates beautifully in the demo. `result_type=Recommendation` parses, the fields are typed, mypy is happy. Then it hits real traffic and the agent confidently recommends a product that was discontinued six months ago, cites a price that changed last quarter, and quotes a policy paragraph that no longer exists. The output is type-valid. It is also wrong. Pydantic-AI validated the shape of the answer, not the truth of it.

Sanity Context is Sanity's agent-facing product, and its Context MCP endpoint gives a Pydantic-AI agent something a type annotation never can: a read-only, queryable source of actual current content to call into as a tool instead of relying on whatever the model half-remembers from training.

That gap is the whole problem. Type safety on the LLM boundary tells you the model returned a well-formed `Recommendation`. It says nothing about whether the data feeding that recommendation was current, complete, or even real. The model fills gaps with plausible inventions, and a plausible invention passes a Pydantic validator just fine.

This article is about closing that second gap. We will cover how to make Pydantic-AI's structured outputs and dependency injection do real work, how to fail loudly when the model strays outside its grounding, and how to feed the agent typed, current content through a tool layer so the answers are not just well-shaped but actually right.

Structured outputs validate the shape, not the facts

Pydantic-AI's headline feature is that `result_type` runs your model's output through a real Pydantic validator. You define a model, the agent guarantees the LLM returns something that parses into it, and retries on failure. That is genuinely useful and most frameworks do not give it to you.

The trap is reading more guarantee into it than exists. Consider a support agent that returns a structured answer with a citation. The validator confirms `answer` is a string and `source_doc_id` is a string. It cannot confirm that `source_doc_id` points at a document that exists, or that the answer actually follows from that document. If the model invents `doc_4417` because the prompt context was thin, you get a clean parse and a fabricated citation.

The fix starts with treating validation as a place to enforce business rules, not just types. Pydantic-AI lets you raise `ModelRetry` from a validator to send the model back with feedback. Use it. A validator that checks the cited ID against a known set turns a silent hallucination into a retry with a corrective message.

A result validator that rejects unknown citations

ModelRetry turns a fabricated citation into a corrective second attempt instead of a clean-but-wrong response.

from pydantic import BaseModel
from pydantic_ai import Agent, ModelRetry, RunContext

class SupportAnswer(BaseModel):
    answer: str
    source_doc_id: str

agent = Agent(
    'anthropic:claude-sonnet-4-5',
    result_type=SupportAnswer,
    deps_type=set,  # the set of valid doc ids for this run
)

@agent.result_validator
def citation_must_exist(
    ctx: RunContext[set], result: SupportAnswer
) -> SupportAnswer:
    if result.source_doc_id not in ctx.deps:
        raise ModelRetry(
            f'{result.source_doc_id} is not a real document. '
            'Cite only an id present in the retrieved context.'
        )
    return result

Dependency injection is where grounding belongs

Pydantic-AI's `deps_type` and `RunContext` are not decoration. They are the right place to inject the data and services an agent needs at call time, and they keep that wiring out of your prompt strings. A lot of teams discover this only after they have stuffed a database handle into a global and made the agent untestable.

The pattern that scales: define a dependencies dataclass that carries the live connections and the current request scope, pass it per run, and let your tools read it off `ctx.deps`. Now the same agent definition serves production and tests, because in a test you pass a fake. More importantly, your retrieval logic lives in tools that have typed access to real services, instead of relying on whatever happened to land in the system prompt.

This matters for grounding specifically. The reason agents hallucinate facts is that the facts were never in front of them. A tool with `ctx.deps` access to your content backend can fetch the real, current record at the moment the model asks for it, rather than depending on a static blob baked into the prompt at startup. The dependency injection layer is the seam where 'what the model knows' meets 'what is actually true right now', and Pydantic-AI gives you a typed handle on exactly that seam.

Typed dependencies wired into a tool

The tool reads its live HTTP client off ctx.deps, so the same agent runs against real and fake backends without code changes.

from dataclasses import dataclass
import httpx
from pydantic_ai import Agent, RunContext

@dataclass
class Deps:
    http: httpx.AsyncClient
    dataset: str

agent = Agent('openai:gpt-4o', deps_type=Deps)

@agent.tool
async def get_product(ctx: RunContext[Deps], slug: str) -> dict:
    """Fetch the current product record by slug."""
    resp = await ctx.deps.http.get(
        'https://api.example.com/products',
        params={'slug': slug, 'dataset': ctx.deps.dataset},
    )
    resp.raise_for_status()
    return resp.json()

Why retrieval, not the model, is usually the bug

When a Pydantic-AI agent gives a wrong answer in production, the instinct is to blame the model, swap to a bigger one, or rewrite the system prompt. Most of the time the trace tells a different story. Log what every tool returned on the failing run and you will usually find the model was working from bad input. The product lookup returned a stale cache. The search tool matched on the wrong field. The context window held three near-duplicate paragraphs and none of the one that mattered.

This is why instrumenting your tools is non-negotiable before you tune anything else. Pydantic-AI integrates cleanly with OpenTelemetry through Logfire, and the thing worth capturing is the full tool input and output, not just the final message. You want to be able to read, for a given bad answer, exactly what record the agent saw. Nine times out of ten the model reasoned correctly over wrong data.

Once you accept that retrieval is the failure surface, the engineering problem sharpens. It is not 'make the model smarter'. It is 'put the correct, current, complete record in front of the model on this turn'. That reframing is what the rest of this article is about: building a tool layer that retrieves the right thing.

⚠️

Bigger models do not fix retrieval bugs

Upgrading from gpt-4o-mini to a frontier model rarely fixes a hallucination caused by stale or missing context. The model was already reasoning fine. It was reasoning over the wrong document. Trace your tool outputs before you touch the model choice, and you will save the upgrade for the problems it actually solves.

Vector search alone returns the confidently wrong document

The default reach when an agent needs to look something up is a vector database. Embed the query, embed the corpus, return the nearest chunks. It works until the query has a structural component that embeddings cannot represent. 'The current return policy for orders shipped to Germany' has three predicates hiding in it: current (publication state), return policy (category), and Germany (region). Pure cosine similarity flattens all of that into vibes. It will happily return last year's policy because it reads semantically identical to this year's.

The failure is quiet and that is what makes it dangerous. The retrieved chunk looks right, parses into your result type, and ships. Nobody notices the policy is six months out of date until a customer does. You cannot validate your way out of this in Pydantic-AI because the validator only sees the model's output, not the silent mis-retrieval upstream.

What you actually want is retrieval that respects structure: filter on the predicates that have a definite answer (state, region, category), then rank what survives by semantic relevance. Vector search is one ingredient, not the whole recipe. For high-volume machine-generated corpora that nobody edits, a dedicated vector store is still the right tool. But for editorial content (a catalog, a policy library, articles with a schema) you want the structural filter and the semantic rank in the same query, and that is a different shape of backend than a bare vector DB.

ℹ️

Structured predicates first, semantics second

In production, the heavy majority of retrieval calls an agent needs are structural: an exact lookup, a filter on state or category, a join on a reference. Semantic similarity is a small but real slice you reach for when keyword and filter retrieval leave the agent guessing. Build the structured path first; add embeddings where the failures justify them.

Give the agent a typed content backend through a tool

The clean fix is to back the agent with a content layer that is typed, current, governed, and queryable on both structure and meaning. This is where Sanity Context fits into a Pydantic-AI stack. It is not a competing agent framework and it does not replace Pydantic-AI. It sits underneath your agent in the retrieval layer and hands your tools real, schema-aware content instead of a static JSON blob you sync by hand.

The fastest way in is the Context MCP endpoint, a hosted, read-only MCP server. Pydantic-AI speaks MCP natively, so you attach it as a toolset and your agent gets schema-aware retrieval tools without you writing wrappers. Read-only matters here: the agent can query, but it cannot mutate your content through MCP, which is exactly the boundary you want for an autonomous loop. Writes, when you need them, go through Agent Actions, not the MCP surface.

Under the hood this is Sanity's Content Operating System for the AI era: the same Content Lake your editors publish into, exposed to the agent as queryable, typed content. Your editors model the business in a schema and govern what is live; your agent reads exactly that, current as of this turn. The product fixes the demo-to-production gap from the other side. The recommendation is well-shaped because Pydantic validated it, and it is true because the content under it was modeled, governed, and retrieved correctly.

Attaching the Context MCP endpoint to a Pydantic-AI agent

Pydantic-AI attaches the read-only Context MCP endpoint as a toolset; the agent gets schema-aware retrieval with no custom wrappers.

from pydantic_ai import Agent
from pydantic_ai.mcp import MCPServerHTTP

# Hosted, read-only MCP endpoint for your dataset
sanity_context = MCPServerHTTP(
    url='https://<projectId>.api.sanity.io/v1/mcp',
)

agent = Agent(
    'anthropic:claude-sonnet-4-5',
    toolsets=[sanity_context],
    system_prompt=(
        'Answer only from documents returned by the Sanity tools. '
        'If the tools return nothing relevant, say you do not know.'
    ),
)

async def main():
    async with agent.run_mcp_servers():
        result = await agent.run(
            'What is the current return policy for orders to Germany?'
        )
        print(result.output)

Hybrid retrieval in one query, and where it does not belong

When you do need semantic ranking, the win is doing it in the same query as your structural filters rather than across two systems you have to reconcile. In GROQ, Sanity's query language, you filter on predicates inside the `*[ ... ]` selector, then use `score()` with `text::semanticSimilarity()` to rank the survivors. Structure narrows the candidate set to documents that are actually correct; semantics orders what is left by relevance. The German return policy query becomes a filter on region and publication state, then a semantic rank, in one round trip.

Worth knowing: embeddings are opt-in and off by default, and most projects shipping on Context MCP never turn them on. The structured path (exact lookups, schema-aware filters, the compressed initial context) carries the heavy majority of calls. You reach for semantic similarity when the agent's failures show that keyword and filter retrieval are leaving it guessing, not as a reflex.

For unstructured sources (PDFs, scraped pages, a support database), the right surface is Knowledge Bases, which turns a messy corpus into well-ordered documents with a table of contents the agent can navigate. And for a high-volume corpus of machine-generated vectors that no human will ever edit or govern, a dedicated vector DB still earns its place; not everything belongs in your content backend. The judgment is matching the retrieval surface to the shape of the content, and giving your Pydantic-AI tools the one that fits.

Structural filter plus semantic rank in one GROQ query

*[
  _type == "policy" &&
  region == $region &&
  status == "published"
]
| score(
    boost(category == "returns", 2),
    text::semanticSimilarity($queryText)
  )
| order(_score desc)[0...3]{
  _id, title, body, _score
}