Improve accuracy of Mastra agents with Sanity Context

Your Mastra agent works great in the demo. A user asks "show me the spring jackets under $200 that are in stock," the agent calls your `searchProducts` tool, and it returns six fleece pullovers from last winter that are out of stock. The model didn't hallucinate. The tool did its job. The problem is that the tool flattened a structured query into a vector search, and "spring," "$200," and "in stock" are exactly the predicates embeddings are worst at honoring.

Sanity Context is where that breaks differently. Its Context MCP endpoint speaks GROQ, which means predicates like price ranges and stock status stay predicates, not proximity scores handed to a nearest-neighbor index.

This is the gap between a tool that compiles and a tool the agent can trust. Mastra makes it trivial to define a tool with `createTool`, give it a Zod schema, and hand it to an `Agent`. What it can't do for you is guarantee that the tool returns the right rows when the query has a structural component. That part is on you, and it's where most agents quietly degrade in production.

This article covers how to build schema-aware tools in Mastra that respect structure: how to type your tool I/O so the model calls it correctly, how to debug retrieval failures with Mastra's tracing, and how to back those tools with GROQ retrieval through Sanity Context so date ranges, stock status, and publication state resolve as predicates instead of getting blurred into a similarity score.

Why your Mastra tool returns the wrong rows

Start with the failure, because it's specific. You wrote a tool like this and handed it to an agent:

The agent calls `searchProducts({ query: "spring jackets under $200 in stock" })`. Your tool embeds that whole string, runs a nearest-neighbor lookup against a vector index, and returns the top six by cosine similarity. The results read plausibly to the model, so it summarizes them confidently. Nobody notices the price filter was ignored until a customer does.

The root cause is that you encoded a structured request as an unstructured one. "Under $200" is a numeric predicate. "In stock" is a boolean on current inventory. "Spring" is a category or season tag. None of those are similarity questions, but a single embedded string forces them through a similarity ranker. Vector search is genuinely good at "find me jackets that feel like this description." It is bad at "and only the ones priced below 200 that we can ship today."

The second-order problem is that the model has no way to know the tool got it wrong. From the agent's perspective the tool succeeded: it returned rows, the schema validated, the loop continued. There's no error to catch. This is why these bugs survive code review and show up as a slow drip of bad answers in production rather than a stack trace. The fix is not a better embedding model. It's giving the tool a contract that separates the structural part of the query from the semantic part, and a retrieval backend that honors both.

Type the tool so the model fills in the right slots

Mastra tools are only as good as their input schema. If your schema is `{ query: string }`, you've told the model "jam everything into one field," and it will. The fix is to expose the structure in the Zod schema so the model has a slot for each predicate. Mastra passes that schema to the LLM as the tool's function signature, so the model will populate `maxPrice`, `inStock`, and `season` as separate arguments when you give it separate fields.

This is the single highest-leverage change you can make, and it costs you a few lines of Zod. The model is far better at extracting `maxPrice: 200` into a typed number field than it is at preserving "under $200" through an embedding. Once the arguments arrive structured, your tool can route them: structural fields become filters, the free-text remainder becomes the semantic part.

The `outputSchema` matters just as much, and most people skip it. When you type the output, Mastra validates the tool's return before it reaches the model, and the model gets a predictable shape to reason over. An untyped output means the agent sees whatever your function happened to return that call, which makes downstream reasoning brittle. Type both ends. The schema is the contract between the model's intent and your data layer.

A schema-aware product search tool in Mastra

Separate slots for structural predicates keep the model from blurring price and stock into a single embedded string.

import { createTool } from "@mastra/core/tools";
import { z } from "zod";

export const searchProducts = createTool({
  id: "search-products",
  description:
    "Search the product catalog. Use semantic for descriptive terms; use the typed filters for price, stock, and season.",
  inputSchema: z.object({
    semantic: z.string().describe("Descriptive part only, e.g. 'lightweight rain jacket'"),
    maxPrice: z.number().optional(),
    inStock: z.boolean().optional(),
    season: z.enum(["spring", "summer", "fall", "winter"]).optional(),
  }),
  outputSchema: z.object({
    products: z.array(
      z.object({
        id: z.string(),
        title: z.string(),
        price: z.number(),
        inStock: z.boolean(),
      }),
    ),
  }),
  execute: async ({ context }) => {
    const { semantic, maxPrice, inStock, season } = context;
    // structural fields -> filters, semantic -> ranker (next section)
    return { products: await runRetrieval({ semantic, maxPrice, inStock, season }) };
  },
});

Trace what the tool actually saw, not what the model said

When an agent gives a wrong answer, the instinct is to blame the model and tweak the prompt. In production, the failure is almost always upstream: the retrieval returned the wrong rows, and the model faithfully summarized garbage. You can't debug that from the final message. You have to log what the tool saw.

Mastra emits OpenTelemetry traces, and you should turn them on before you ship anything. Configure telemetry in the Mastra instance and your tool calls, arguments, and results become spans you can inspect in any OTel-compatible backend. The span you care about is the tool execution: the exact arguments the model passed, and the exact rows the tool returned. When a customer complains, you pull that trace and immediately see whether `maxPrice` was even populated or whether the model dropped it.

This reframes debugging. Instead of "the model hallucinated," you get "the model called `searchProducts` with `maxPrice: undefined`, so my schema description wasn't clear enough" or "the model passed the filters correctly but my retrieval ignored them." Those are two completely different fixes, and the trace tells you which one you have. Log the resolved query alongside the result, not just the count of rows, so you can see the structural part actually applied.

Enable OpenTelemetry tracing on the Mastra instance

With telemetry on, each tool call becomes a span showing the exact arguments and returned rows.

import { Mastra } from "@mastra/core";

export const mastra = new Mastra({
  agents: { catalogAgent },
  telemetry: {
    serviceName: "catalog-agent",
    enabled: true,
    export: {
      type: "otlp",
      endpoint: process.env.OTEL_EXPORTER_OTLP_ENDPOINT,
    },
  },
});

Resolve structure as predicates, semantics as a score

Now the part that actually fixes the wrong-rows bug. Once your Mastra tool receives structured arguments, the retrieval backend has to honor the difference between a predicate and a ranker. `maxPrice`, `inStock`, and `season` are predicates: a row either qualifies or it doesn't. `semantic` is a ranker: among the qualifying rows, order by relevance. Conflating the two is the original sin that started this article.

This is exactly what GROQ retrieval through Sanity Context gives you in one query. The structural fields go in the filter, `*[ ... ]`, where they behave as hard predicates: a jacket priced at 210 is excluded, full stop, no similarity score can rescue it. The descriptive remainder feeds `text::semanticSimilarity($queryText)` inside a `score()` pipeline, which sets `_score` so you can `order(_score desc)`. Structure narrows the set; semantics ranks what's left.

Worth saying plainly: most projects don't need the semantic part at all. Production data on Context MCP shows the heavy majority of agent retrieval calls are structured (GROQ filters and schema lookups), and embeddings are opt-in and off by default. If your users ask "jackets under $200 in stock," a pure predicate query answers it perfectly with no embeddings in sight. Reach for `text::semanticSimilarity()` only when the descriptive matching genuinely fails without it. Hybrid means structural predicates plus optional similarity, not similarity everywhere.

Hybrid retrieval in one GROQ query: predicates filter, similarity ranks

Price, stock, and season are hard predicates in the filter; the descriptive query ranks the survivors.

*[_type == "product"
  && price <= $maxPrice
  && inStock == true
  && season == $season
] | score(
    boost(title match text::query($queryText), 2),
    text::semanticSimilarity($queryText)
  )
  | order(_score desc) [0...6] {
    _id, title, price, inStock
  }

Wire it into the agent: MCP first, custom tool second

There are two ways to get this retrieval into your Mastra agent, and the fastest is not the one most people reach for. Sanity ships Context MCP, a hosted, read-only MCP endpoint that exposes schema-aware tools to any agent loop. Mastra speaks MCP natively through `@mastra/mcp`, so you attach the endpoint as an MCP server and the agent gets typed query tools without you writing a single GROQ string. This is the right default: less code to maintain, and the tools stay in sync with your content schema.

The read-only constraint is worth internalizing. The MCP endpoint lets an agent read and query content; it does not write. That's a feature, not a gap. A customer-facing catalog agent has no business mutating your CMS, and the protocol enforces it. When you do need writes (drafting content, updating a field), those go through Agent Actions, a separate governed path, not through MCP.

The second path is a custom tool: the thin `createTool` wrapper from earlier, with `execute` running a typed GROQ query through `createClient` from `next-sanity`. Use this when you want full control over the exact query, want to combine the result with other data sources before returning, or want to shape the output schema precisely. Both paths land in the same place, an agent whose tools honor structure. Start with MCP; drop to a custom tool when you outgrow it.

Attach the hosted Context MCP endpoint to a Mastra agent

Mastra attaches the hosted MCP endpoint as a server and gets schema-aware query tools out of the box.

import { Agent } from "@mastra/core/agent";
import { MCPClient } from "@mastra/mcp";
import { openai } from "@ai-sdk/openai";

const mcp = new MCPClient({
  servers: {
    sanity: {
      url: new URL("https://mcp.sanity.io/mcp"),
      // read-only: query and read content, no writes
    },
  },
});

export const catalogAgent = new Agent({
  name: "catalog-agent",
  instructions:
    "Answer product questions. Use the Sanity tools; pass price, stock, and season as filters, not as part of the search text.",
  model: openai("gpt-4o"),
  tools: await mcp.getTools(),
});

Where to keep agent instructions, brand voice, and approved answers

One more pattern that bites Mastra teams as they scale past a single agent. Your agent's instructions, its allowed product descriptions, the approved phrasing for refunds, the brand voice it should match, all of that is content. Today it probably lives as string literals in your TypeScript, which means a copy change is a code deploy and a PR review by an engineer who isn't the person who owns the copy.

Mastra gives you a clean place to inject this at runtime: instructions and tool results are just data you pass into the `Agent`. The question is where that data should live. Per-user chat history and ephemeral session state belong in fast key-value storage (Upstash, Redis); that's not what this is. The instructions, knowledge, and approved responses are editorial assets that should be versioned, reviewed by a human, and previewed before they reach users.

This is the editorial side of agent state, and it's where Sanity sits as the Content Operating System for the AI era: the intelligent backend where the content your agent reads is modeled, governed, and edited by the people who own it, not buried in code. Store the agent's instructions and approved responses as documents, let the content team edit them in the Studio, and stage changes with Content Releases so a new refund policy goes live the same way a marketing page does, reviewed, previewed, and scheduled. Your agent reads the current version through the same MCP endpoint or GROQ query it already uses.

ℹ️

Keep ephemeral state and editorial state in different places

Per-user conversation history is high-write, throwaway, and belongs in Upstash or Redis. Agent instructions, brand voice, and approved responses are low-write, governed, and human-edited; those belong in versioned content where Content Releases can stage and preview changes before they reach a single user. Mixing the two is how a hotfix to refund copy turns into a code deploy.

When a dedicated vector DB is still the right call

To be honest about the boundaries: GROQ retrieval through Sanity Context is the right tool when your content is structured and editorially governed, a product catalog, an article library, a documentation set with a real schema. It's not the answer to every retrieval problem your Mastra agent will face.

If your corpus is messy and unstructured, PDFs, scraped websites, an export of support tickets, the routing changes. That's what Sanity Context Knowledge Bases (launched September 2026) is for: it turns those sources into well-ordered documents with a clear table of contents that an agent can navigate, rather than forcing you to chunk and embed raw text by hand. Structured content goes through GROQ; unstructured content goes through Knowledge Bases.

And if you're retrieving over a high-volume, machine-generated corpus that nobody is going to edit, logs, telemetry, millions of auto-generated embeddings, a dedicated vector database (Pinecone, Weaviate, Qdrant, pgvector) is still a perfectly good choice, and probably a better one. Not everything belongs behind an editorial workflow. The point of this article isn't that every Mastra tool should query Sanity. It's that when your agent returns the wrong rows because it blurred structure into similarity, the fix is to separate the predicate from the ranker, type the tool so the model fills the right slots, and trace what the tool actually saw. Pick the retrieval backend that matches the shape of your content, and your agent stops quietly lying.

⚠️

Don't put a machine-generated corpus behind an editorial workflow

Content Releases, review, and preview are valuable for content humans own and edit. For millions of auto-generated log embeddings that nobody will ever curate, that governance is pure overhead. Route high-volume, no-editorial corpora to a dedicated vector DB; route catalog and article content to GROQ; route messy PDFs and support data to Knowledge Bases.