Trace Arize agent calls against Sanity Context to close the retrieval gap

Your Arize dashboard says the model is fine. Span latency is healthy, token counts are normal, the LLM eval scores hover around 0.8. But users keep reporting answers that cite the wrong product variant, miss a published update, or confidently quote a document that was unpublished three weeks ago. You stare at the trace waterfall and the model call looks clean. The problem is upstream, in a span most teams barely instrument: retrieval.

Sanity Context is Sanity's agent-facing product. Its primary surface today is Context MCP, a hosted, read-only MCP endpoint that exposes schema reads, GROQ queries, reference traversal, and optional semantic search across a Sanity dataset. Knowledge Bases is the second surface, for unstructured sources like PDFs, websites, and support data. The point for an Arize user is narrow: when you log what the agent actually retrieved as structured span attributes, the failure stops being a mystery and becomes a query you can fix.

This article stays on the Arize side first. We will instrument retrieval spans properly, write an eval that scores retrieval independently of generation, and only then wire in a content source whose tool calls and query results are legible enough to show up cleanly in your traces.

The span you forgot to instrument

Most Arize setups instrument the LLM call beautifully and the retrieval call barely at all. You get an OpenInference trace with a clean CHAIN span, a fat LLM span carrying the prompt and completion, and a RETRIEVER span that, if it exists at all, contains a document count and nothing else. So when an answer is wrong, the dashboard points you at the model, because the model span is the only one with enough attributes to interrogate.

This is the retrieval-quality gap. Generation eval scores can look healthy while retrieval is quietly handing the model the wrong documents. The model then faithfully summarizes garbage, and a faithfulness eval gives it a passing grade because the answer IS grounded in the retrieved context. It is grounded in the wrong context. Arize cannot tell you that unless the retrieved documents are in the span.

The OpenInference semantic conventions already have a place for this. A retriever span carries a `retrieval.documents` attribute, an array where each entry has `document.id`, `document.content`, `document.score`, and `document.metadata`. If you populate those, the Arize UI renders each retrieved chunk with its score, and you can sort, filter, and eval against them. If you leave them empty, every retrieval looks identical and the failure is invisible. The first fix has nothing to do with any vendor. Log the documents.

Instrumenting the retriever span by hand

The auto-instrumentors for common frameworks will create retriever spans, but they only capture what the framework exposes. The moment your retrieval is a custom function (a direct query, an MCP tool call, a hand-rolled hybrid search) you own the span, and you should set the document attributes explicitly. Here is the shape using the OpenInference span attributes directly on top of OpenTelemetry, which is what Arize ingests.

The key detail is the attribute naming. It is positional: `retrieval.documents.0.document.content`, `retrieval.documents.1.document.content`, and so on. Get the indices wrong and the UI silently drops the document. Set `openinference.span.kind` to `RETRIEVER` so Arize renders the right card. Put the actual query text on `input.value` so you can later group failures by query pattern.

Once these are flowing, the retrieval span stops being a black box. You can open any trace where the answer was wrong and read exactly which documents the agent saw, with their scores, before the model ever ran. Nine times out of ten the bad answer correlates with a low-scoring top result or a correct document that ranked fourth and fell outside your top-k cutoff. Neither of those is a model problem.

Setting OpenInference retriever attributes on a span

Populate retrieval.documents so Arize renders each chunk with its score.

from opentelemetry import trace
from openinference.semconv.trace import (
    OpenInferenceSpanKindValues,
    SpanAttributes,
    DocumentAttributes,
)

tracer = trace.get_tracer(__name__)

def traced_retrieve(query: str, docs: list[dict]):
    with tracer.start_as_current_span("retrieve") as span:
        span.set_attribute(
            SpanAttributes.OPENINFERENCE_SPAN_KIND,
            OpenInferenceSpanKindValues.RETRIEVER.value,
        )
        span.set_attribute(SpanAttributes.INPUT_VALUE, query)
        for i, doc in enumerate(docs):
            prefix = f"{SpanAttributes.RETRIEVAL_DOCUMENTS}.{i}.{DocumentAttributes.DOCUMENT_"
            span.set_attribute(f"{prefix}ID}", doc["id"])
            span.set_attribute(f"{prefix}CONTENT}", doc["content"])
            span.set_attribute(f"{prefix}SCORE}", doc["score"])
        return docs

An eval that scores retrieval, not the answer

Once the documents are in the span, you can run an Arize eval over the retrieval step in isolation, which is the eval most teams skip. The built-in relevance template in `arize-phoenix-evals` takes the query and a single retrieved document and asks an LLM judge whether that document is relevant to the query. Run it per document, per span, and you get a precision signal: of the k documents you retrieved, how many were actually relevant?

This is different from a faithfulness or hallucination eval, which only looks at whether the final answer is supported by the retrieved context. Faithfulness can pass while relevance fails. The classic shape: query asks for the Q3 pricing of the enterprise tier, retrieval returns three documents about Q2 pricing of the pro tier, the model faithfully answers from them, faithfulness scores 1.0, and the user gets a wrong number. A retrieval relevance eval catches this because all three documents score irrelevant against the query.

Run the relevance eval as a batch over a day of production spans pulled from Arize, group the irrelevant hits by query pattern, and you have a ranked list of exactly where retrieval is failing. That list is the work. It tells you whether the fix is a better embedding model, a structural filter the embeddings can't express, or a reranking step.

⚠️

A passing faithfulness score is not a passing retrieval score

Faithfulness and hallucination evals check whether the answer is grounded in the retrieved context. They say nothing about whether the retrieved context was the right context. An agent can score 1.0 on faithfulness while answering from documents that are completely irrelevant to the user's question. Always eval retrieval relevance separately, or you will ship confident wrong answers with green dashboards.

Running a retrieval relevance eval over production spans

Score each retrieved document for relevance, independent of the final answer.

from phoenix.evals import (
    RelevanceEvaluator,
    OpenAIModel,
    run_evals,
)
import phoenix as px

# pull retriever spans logged to Arize/Phoenix
spans_df = px.Client().get_spans_dataframe(
    "span_kind == 'RETRIEVER'"
)

# explode retrieval.documents into one row per (query, document)
retrievals_df = px.Client().get_retrieved_documents(spans_df)

relevance_evaluator = RelevanceEvaluator(OpenAIModel(model="gpt-4o"))

results_df = run_evals(
    dataframe=retrievals_df,
    evaluators=[relevance_evaluator],
    provide_explanation=True,
)[0]

# precision@k per query: how many retrieved docs were relevant
print(results_df["label"].value_counts())

Why the irrelevant documents were retrieved

Group your failing relevance evals and a pattern usually emerges: the query had a structural component that pure vector search could not honor. The user asked for the current price, but embeddings have no concept of publication state, so the agent retrieved a draft. The user asked for the spec of a specific product variant, but the embedding of the question is close to every variant, so cosine similarity returned the wrong one. The user asked about changes after a date, but a vector index has no notion of after.

This is the core limitation worth naming clearly. Vector search and RAG are not the same thing as good retrieval. They are one ingredient. Semantic similarity is excellent at fuzzy topical match and useless at the structural predicates that real queries carry: date ranges, authorship, reference relationships, product variant, locale, and publication status. When your relevance failures cluster around any of those, more embeddings will not help. You need retrieval that can express a structural filter AND a semantic match in the same query.

That is the gap that an Arize trace exposes so well, because once the documents are in the span you can see that the top result was topically close but structurally wrong. The diagnosis points directly at the fix: a retrieval call that takes predicates seriously.

Closing the gap with Sanity Context

This is where Sanity Context enters as part of the fix, not as a replacement for Arize. For structured content (a product catalog, an article archive, anything with a schema) Sanity Context exposes GROQ retrieval through the Context MCP endpoint, a hosted read-only MCP server your agent attaches like any other tool source. GROQ lets you put the structural predicates where they belong, inside the query filter, and run semantic similarity in the same pass.

The mechanism that matters here is hybrid retrieval in one query. Structural filters live as predicates inside `*[ ... ]`. Semantic ranking is a separate `score()` pipeline using `text::semanticSimilarity()`, which takes the query text directly. So you can say in one call: only published documents, only this product variant, ranked by semantic closeness to the user's question. The draft never gets retrieved because the filter excludes it before scoring runs. The wrong variant never ranks because it is filtered out, not merely outscored.

Important nuance, because it is the most misframed fact: semantic search is opt-in and off by default. The heavy majority of production calls on Context MCP are plain structured GROQ and schema lookups, no embeddings at all. You reach for `text::semanticSimilarity()` only when your Arize relevance evals show that a structural filter alone isn't enough. For unstructured corpora (PDFs, support exports, scraped sites) the right surface is Knowledge Bases instead, which orders messy sources into retrievable documents. And for high-volume machine-generated data that needs no editorial governance, a dedicated vector DB still has its place. The routing is the point, not a single hammer.

Hybrid GROQ retrieval: structural filter plus semantic score

Published-only and variant-only predicates filter first; semantic similarity ranks what remains.

*[_type == "product" && status == "published" && variant == $variant]
| score(
    boost(category == $category, 2),
    text::semanticSimilarity(description, $queryText)
  )
| order(_score desc)
[0...5]{
  _id,
  title,
  variant,
  description,
  _score
}

Logging the GROQ call back into your Arize traces

The reason this pairs well with an observability practice: the Context MCP tool call and its result are legible. When the agent calls the MCP endpoint, you get a tool span with the exact GROQ query as input and the returned documents as output. Drop those into the same `retrieval.documents` attributes from the first section and they render in Arize exactly like any other retriever span, with per-document scores from `_score`.

Now your trace tells the whole story. You can open a failed conversation, read the GROQ query the agent issued, see which documents came back and at what score, and run the relevance eval over them. If a document was wrong, you can tell instantly whether it was a filter problem (the predicate let something through it shouldn't have) or a ranking problem (the right document was retrieved but ranked below your cutoff). Those are two different fixes, and you can only distinguish them when the query and its scored results are both in the span.

The loop closes here. Arize tells you retrieval is failing and clusters the failures by pattern. The cluster tells you whether the missing piece is a structural predicate or a semantic match. A hybrid GROQ query through Context MCP, or a Knowledge Base for unstructured sources, supplies the retrieval that honors both. And because that call logs cleanly, your next Arize eval run measures whether the fix actually worked. The model was never the problem. The span you forgot to instrument was.

💡

Trace the tool call, not just the LLM call

When you attach Context MCP as a tool source, instrument the tool span the same way you instrument any retriever: put the GROQ query on input.value and the returned documents (with their _score) on retrieval.documents. That single habit turns retrieval from the least observable step in your agent into the most debuggable one in Arize.