Improve accuracy of CrewAI agents with Sanity Context

Your CrewAI research crew runs clean in the demo. Then a customer asks your support agent which plan includes SSO, and it confidently invents a "Business Plus" tier that has never existed. The model isn't broken. It just had nothing real to answer from, so it pattern-matched a plausible-sounding answer. Wrap a few agents in a Crew and the failure compounds: one agent hallucinates a fact, the next agent treats it as ground truth, and by the time the result reaches your task output the fabrication is three hops deep and looks authoritative.

Sanity Context is Sanity's agent-facing product, and its Context MCP endpoint gives each agent in your Crew a tool that returns real schema-backed data via GROQ, so the chain answers from facts instead of pattern-matching a confident fiction.

The fix is not a better prompt or a sterner system message telling the agent "do not make things up." Models don't obey that instruction under pressure. The fix is retrieval: give the agent a tool that returns your real company facts, scoped to the question, before it answers. This article shows how to build that in CrewAI, where tool calls and grounding actually live in the framework, how to tell the LLM to abstain when retrieval comes back empty, and where Sanity Context fits as the structured source those tools read from.

Why a Crew hallucinates, and where the fabrication enters

Hallucination in a single LLM call is a known quantity. In a Crew it gets worse because agents pass outputs to each other as context. CrewAI's `Process.sequential` feeds the result of one task into the next as part of the prompt. If your `researcher` agent makes up a pricing tier, your `writer` agent receives that invented tier as established fact and elaborates on it. There is no point in the chain where the fabrication gets challenged, because no agent has access to a source of truth to check against.

The naive instinct is to harden the backstory and goal strings. You write `backstory='You are a meticulous analyst who never invents facts'` and hope. It doesn't hold. The model has no facts to be meticulous about. When the question is 'does the Pro plan include audit logs' and nothing in the context window mentions audit logs, the model fills the gap with the statistically likely answer, which for a SaaS pricing page is 'yes, on higher tiers.' That guess is wrong roughly as often as it's right.

The real lever in CrewAI is the `tools` parameter on an `Agent`. A tool is a function the agent can call mid-reasoning to fetch external data. When an agent has a retrieval tool, the LLM's job shifts from 'recall the answer' to 'call the tool, then summarize what it returned.' That's a fundamentally more reliable operation. The agent is no longer the source of facts. It's a router over a source you control. Everything downstream depends on getting that tool right.

A two-agent crew where the fabrication propagates

from crewai import Agent, Task, Crew, Process

researcher = Agent(
    role="Product Researcher",
    goal="Find accurate details about our plans",
    backstory="A meticulous analyst who never invents facts.",
    verbose=True,
)

writer = Agent(
    role="Support Writer",
    goal="Answer the customer clearly",
    backstory="You write concise support replies.",
    verbose=True,
)

research = Task(
    description="Which plan includes SSO and audit logs?",
    agent=researcher,
    expected_output="The plan name and the features it includes.",
)
reply = Task(
    description="Write the customer reply.",
    agent=writer,
    expected_output="A short support answer.",
)

# researcher has no facts, so it guesses; writer trusts the guess
crew = Crew(tasks=[research, reply], process=Process.sequential)
print(crew.kickoff())

Give the agent a retrieval tool, then force it to abstain on a miss

The grounding fix has two halves, and most teams ship only the first. Half one: attach a tool that returns real data. Half two: make the agent abstain when that tool returns nothing. Without the second half you've just moved the hallucination, the agent now invents an answer when retrieval misses instead of when the context is empty.

CrewAI exposes tools through the `@tool` decorator or a `BaseTool` subclass. The function signature and docstring become the tool's schema, so the agent knows when and how to call it. Write the docstring like an instruction to the model, because that's exactly what it is. Inside the tool, query your source of truth and return the raw matched records. Critically, return a sentinel like `"NO_MATCHING_FACTS"` when nothing matches, rather than an empty string, because an empty string reads to the model as 'nothing to add' rather than 'you do not know this.'

Then close the loop in the agent's backstory: 'If the lookup returns NO_MATCHING_FACTS, reply that you don't have that information and offer to escalate. Never answer from prior knowledge.' This is the one prompt instruction that actually works, because it's conditioned on an observable tool result the model can see, not on an abstract virtue. Pair it with `Crew(memory=False)` while you debug, so stale entity memory doesn't paper over a retrieval gap and hide the failure from you.

A grounded CrewAI tool with an explicit abstain path

from crewai.tools import tool

@tool("lookup_company_facts")
def lookup_company_facts(question: str) -> str:
    """Look up verified company facts (plans, features, policies)
    relevant to the question. ALWAYS call this before answering a
    factual question. If it returns NO_MATCHING_FACTS, you do not
    know the answer and must say so."""
    records = fetch_facts(question)  # your retrieval call
    if not records:
        return "NO_MATCHING_FACTS"
    return "\n".join(r["text"] for r in records)

researcher = Agent(
    role="Product Researcher",
    goal="Answer only from lookup_company_facts results",
    backstory=(
        "You answer strictly from the lookup_company_facts tool. "
        "If it returns NO_MATCHING_FACTS, you reply that you do not "
        "have that information and offer to escalate. You never "
        "answer factual questions from prior knowledge."
    ),
    tools=[lookup_company_facts],
)

The retrieval call is where grounding actually succeeds or fails

Now `fetch_facts` carries all the weight. The default reach is a vector database: embed every fact, embed the question, return the top-k nearest. That works for 'find me docs about roughly this topic' and fails on exactly the questions support agents get asked. 'Which plan includes SSO' has a structural component. SSO is a discrete feature on a specific tier, and the right answer is a join, not a fuzzy neighbor. Pure semantic search will happily return the marketing blurb for a plan that mentions 'security' a lot while omitting the one that actually ships SSO.

The questions that hurt you most are the structured ones. 'What changed in the Pro plan since January.' 'Which features are on Enterprise but not Business.' 'Is the audit-log feature published yet or still in draft.' Date ranges, set differences, publication state. None of those resolve from embedding distance, because the discriminating part of the query isn't semantic, it's a predicate. If your retrieval layer can only do nearest-neighbor, you'll keep shipping confident wrong answers on precisely the queries where being wrong is most expensive.

What you want underneath `fetch_facts` is a query language that can filter on structure first, then optionally rank by relevance, in one round trip. Production retrieval is overwhelmingly structured lookups, not embedding search. Treat semantic similarity as a layer you reach for when keyword and structural filters genuinely can't disambiguate, not as the default tool you swing at every question.

⚠️

Top-k vector search is not the same as good retrieval

Vector search is one ingredient, not the recipe. For 'which plan includes SSO,' embedding distance returns the document that talks most about security, which is often the wrong plan. Structured predicates (tier, feature flag, publication state) resolve these queries correctly where pure similarity cannot. Reach for embeddings when structural filters genuinely can't disambiguate, not by default.

Point the tool at Sanity Context instead of a hand-rolled corpus

Here's where the source of truth matters. Your company facts (plans, features, policies, the brand voice your replies should use) are editorial content. They change, they need approval before they go live, and a human should be able to fix a wrong answer without a redeploy. That is exactly what Sanity is built for. Sanity is the Content Operating System for the AI era: the same structured content your marketing site renders becomes the structured content your CrewAI agents retrieve from, governed and versioned, with no separate sync job copying facts into a vector store that drifts out of date.

The fastest way to wire it into a Crew is the Context MCP endpoint, a hosted, read-only MCP server your agent connects to. CrewAI speaks MCP through `MCPServerAdapter`, so you attach the endpoint and your agents get schema-aware retrieval tools without you writing a single GROQ query. The read-only constraint is a feature here: an agent answering customer questions should never be able to mutate your content, and via MCP it structurally cannot. Writes, when you need them, go through Agent Actions, not this endpoint.

Because Sanity adapts to your content model rather than forcing you into a fixed shape, the tools the MCP endpoint exposes already understand your plans, your tiers, and your publication states. The agent can ask for facts the way they're actually modeled in your business, instead of you flattening everything into undifferentiated text chunks and hoping similarity sorts it out.

Attach the Sanity Context MCP endpoint to a CrewAI crew

from crewai import Agent, Crew
from crewai_tools import MCPServerAdapter

server_params = {
    "url": "https://<projectId>.api.sanity.io/mcp",
    "headers": {"Authorization": "Bearer <SANITY_TOKEN>"},
}

# read-only, schema-aware tools, no GROQ written by hand
with MCPServerAdapter(server_params) as sanity_tools:
    researcher = Agent(
        role="Product Researcher",
        goal="Answer only from Sanity Context tool results",
        backstory=(
            "You answer strictly from the Sanity tools. If they "
            "return no matching facts, you say you do not know and "
            "offer to escalate. Never answer from prior knowledge."
        ),
        tools=sanity_tools,
    )
    crew = Crew(agents=[researcher], tasks=[...])
    crew.kickoff()

When you want full query control: a typed GROQ tool

MCP is the default and the fastest way in, but some teams want to own the query. Maybe you need to combine a hard structural filter with relevance ranking in a specific way, or you want to log the exact query for observability. For that, drop to a thin `@tool` wrapper that runs a GROQ query against the Content Lake yourself. This is the second integration path, not a replacement for MCP, and the two coexist fine in one crew.

The reason this matters for grounding is that GROQ does structured and semantic retrieval in a single query. You filter on the predicates that actually discriminate (publication state, tier, a date range) inside the `*[ ... ]` selector, then rank the survivors by relevance. The structural filter runs first, so you never rank documents that fail the hard requirement. That's the opposite of the vector-DB pattern, where you rank first and filter the survivors, and routinely find the top-k contains nothing that satisfies your predicate.

The snippet below uses `text::semanticSimilarity()` to score documents against the question text, combined through `score()`, then `order(_score desc)`. Note the function takes the query text, not an embedding you computed elsewhere. Embeddings are opt-in in Sanity and off by default, so for most factual lookups you can lead with a plain keyword and structural query and only layer semantic scoring on top when keyword matching genuinely leaves the answer ambiguous.

A CrewAI tool running hybrid GROQ retrieval

from crewai.tools import tool
from sanity_client import sanity  # your configured client

QUERY = '''
*[_type == "feature" && publishState == "published"]
  | score(
      boost(tier == "enterprise", 2),
      text::semanticSimilarity($queryText)
    )
  | order(_score desc) [0...5] {
    name, tier, summary, _score
  }
'''

@tool("lookup_features")
def lookup_features(question: str) -> str:
    """Retrieve published feature facts ranked for the question.
    Returns NO_MATCHING_FACTS when nothing relevant is found."""
    rows = sanity.query(QUERY, {"queryText": question})
    if not rows:
        return "NO_MATCHING_FACTS"
    return "\n".join(f"{r['name']} ({r['tier']}): {r['summary']}"
                     for r in rows)

Debug a hallucination by logging what the agent retrieved

When a wrong answer slips through in production, your first instinct is to blame the model and swap GPT-4o for something else. Resist it. In a grounded Crew, almost every fabrication traces back to retrieval, not generation. The agent answered confidently because the tool handed it the wrong records, or handed it nothing and the abstain instruction didn't fire. You cannot diagnose that from the final task output alone. You need to see the tool call.

Log three things on every factual answer: the question the agent passed to the tool, the raw records the tool returned, and the final text the agent produced. Put them side by side. The failure mode is usually obvious once you do. Either the retrieval returned the marketing copy for the wrong tier (a query problem, fix the predicate or the ranking), or it returned `NO_MATCHING_FACTS` and the agent answered anyway (a prompt problem, strengthen the abstain instruction and consider a hard guardrail that short-circuits the task). CrewAI's `step_callback` gives you a hook on every agent action, including tool invocations, so you can capture this without rewriting your agents.

This is also where keeping facts in a governed content system pays off operationally. When the trace shows the tool returned a stale plan description, an editor fixes it in the Studio and publishes, and the next retrieval is correct, with no embedding re-index, no redeploy, and no engineer in the loop. The Live Content API means that corrected fact is queryable immediately. The retrieval layer becomes a thing your content team can fix, which is exactly where the fix belongs.

💡

Trace the retrieval, not just the answer

Wire a step_callback that logs the tool's input question and its raw output next to the agent's final text. Most 'the model hallucinated' tickets turn out to be 'the tool returned the wrong record' or 'the tool returned NO_MATCHING_FACTS and the agent ignored it.' You can only tell which from the retrieval trace, and the fix is completely different in each case.