Trace agent retrieval into Sanity Context with OpenTelemetry

Your agent gave a confidently wrong answer in production. You open the trace in your collector, and the LLM span is right there, model, token counts, latency, the final completion. But the span that matters is empty. You can see that retrieval happened, you can see it took 240ms, and you have no idea what it returned. The document that should have grounded the answer? Maybe it was never fetched. Maybe it was fetched and ranked fourth. The trace can't tell you, because nobody instrumented the retrieval step with anything more than a duration.

In this setup, the retrieval step is Sanity Context, specifically a GROQ query issued against Context MCP. That call is the span you're missing, and it's exactly where a structured trace should reach.

This is the gap in most agent telemetry. OpenTelemetry gives you a clean spec for spans, attributes, and events, but the GenAI semantic conventions are young, and the retrieval half of RAG is the part teams skip. So debugging a bad answer turns into re-running the query by hand and squinting at the output.

This article is about closing that gap: how to design retrieval spans that capture the query, the candidates, and the ranking; which GenAI semantic-convention attributes to use so your spans aren't bespoke noise; and how to wire the trace down into the actual retrieval call, including a Sanity Context GROQ query, so the span shows the real document set the model saw.

The empty retrieval span

Most agent stacks instrument the LLM call and stop there. The auto-instrumentation for OpenAI or Anthropic gives you `gen_ai.request.model`, prompt and completion token counts, maybe the full messages array if you opted into content capture. That's the generation half. The retrieval half, the part that decides whether the model has the right facts in front of it, usually shows up as a single nested span called something like `retrieve` with a `duration_ms` and nothing else.

When the answer is wrong, that span is useless. You know retrieval ran. You don't know what query string went out, how many candidates came back, what their scores were, or which ones made it into the prompt. So you reconstruct it: copy the user's message, re-run the retriever locally, and hope the corpus hasn't changed since the trace was recorded. For a flaky, hard-to-reproduce failure that's an afternoon gone.

The fix is to treat retrieval as a first-class operation with its own span and its own attributes. A retrieval span should answer three questions without you leaving the trace viewer: what did we search for, what came back, and what did we actually use. The query text, the candidate count, the per-document IDs and scores, and the cut that fed the prompt. Once those are on the span, a wrong answer stops being a mystery, you can see at a glance whether the failure was retrieval (wrong documents) or generation (right documents, bad synthesis). That distinction is the single most valuable thing a trace can give you, and it's exactly what the empty span throws away.

ℹ️

Retrieval failures dominate

In production agent debugging, the failure usually traces back to a bad retrieval, not a bad model. If your trace only instruments the LLM call, you're observing the half that's least likely to be the problem. Log what the agent saw, the query and the result set, alongside what it said.

Span structure: a parent operation with retrieval as a child

Before you decide on attributes, decide on the span tree. A single agent turn is the parent span. Underneath it sit the operations: one or more retrieval spans, then the LLM generation span, sometimes a re-rank or a tool call in between. Keeping retrieval as a sibling of generation, not folded into it, is what lets you measure each independently and attribute latency correctly.

Use the GenAI semantic conventions where they exist so your spans are portable across collectors and not a private dialect. The conventions name the operation with `gen_ai.operation.name`. For the generation span that's `chat`. For retrieval, the conventions are still settling, so a stable choice is to name the span after the system doing the work and set a clear `gen_ai.operation.name` of your own, plus standard `db.*` or custom `retrieval.*` attributes for the query specifics.

The structure below creates the parent turn span, then opens a child span for retrieval that you'll populate with query and result attributes in the next section. The key calls are `tracer.startActiveSpan`, which makes the new span the active context so anything you instrument inside it nests correctly, and `span.end()` in a `finally` so a thrown error still closes the span and records the partial trace.

Parent turn span with a child retrieval span

Retrieval is a sibling of generation under the turn span — not folded into it — so latency and failures attribute cleanly.

import { trace, SpanStatusCode } from '@opentelemetry/api';

const tracer = trace.getTracer('agent-retrieval', '1.0.0');

async function handleTurn(userMessage: string) {
  return tracer.startActiveSpan('agent.turn', async (turnSpan) => {
    turnSpan.setAttribute('gen_ai.operation.name', 'agent');
    try {
      const docs = await tracer.startActiveSpan(
        'agent.retrieve',
        async (retrieveSpan) => {
          try {
            const results = await retrieve(userMessage);
            // attributes populated in the next section
            return results;
          } catch (err) {
            retrieveSpan.recordException(err as Error);
            retrieveSpan.setStatus({ code: SpanStatusCode.ERROR });
            throw err;
          } finally {
            retrieveSpan.end();
          }
        },
      );

      const answer = await generate(userMessage, docs);
      return answer;
    } catch (err) {
      turnSpan.setStatus({ code: SpanStatusCode.ERROR });
      throw err;
    } finally {
      turnSpan.end();
    }
  });
}

Attributes and events: what the retrieval span should carry

A retrieval span needs to capture three layers without bloating every trace. First, the query: the literal text that went to the retriever, and any structural filters applied, a date range, an author, a publication state. Second, the result set: how many candidates came back and, per candidate, an ID and a relevance score. Third, the selection: which of those candidates actually made it into the prompt after truncation or a top-k cut.

Put scalar facts on span attributes, `retrieval.query`, `retrieval.candidate_count`, `retrieval.selected_count`. Put the per-document detail on span events, one event per result, each carrying the document ID and score. Events are the right primitive here because they're timestamped and repeatable, where attributes are a flat key-value bag. A span with twelve `retrieval.result` events reads in any trace viewer as an ordered, scored candidate list, which is exactly the artifact you wanted when the answer came back wrong.

Watch the cardinality and the payload size. Don't put full document bodies on the span; put IDs and scores, and keep the raw query text. If your collector charges by span volume or attribute bytes, capping result events at the top 20 candidates keeps cost bounded while still showing the ranking decision. The point isn't to mirror the whole corpus into your traces. The point is to make the retrieval decision auditable, to see that document `article-9f2` scored 0.81 and ranked first, while the one that should have answered the question scored 0.34 and never made the cut.

Recording query, candidates, and selection on the span

Scalars on attributes, per-document detail on timestamped events. IDs and scores only — never full document bodies.

import { trace } from '@opentelemetry/api';

type Doc = { _id: string; _score: number; title: string };

function recordRetrieval(
  span: ReturnType<typeof trace.getActiveSpan> & object,
  query: string,
  candidates: Doc[],
  selected: Doc[],
) {
  span.setAttribute('retrieval.query', query);
  span.setAttribute('retrieval.candidate_count', candidates.length);
  span.setAttribute('retrieval.selected_count', selected.length);

  const selectedIds = new Set(selected.map((d) => d._id));
  for (const [rank, doc] of candidates.slice(0, 20).entries()) {
    span.addEvent('retrieval.result', {
      'retrieval.doc.id': doc._id,
      'retrieval.doc.score': doc._score,
      'retrieval.doc.rank': rank,
      'retrieval.doc.selected': selectedIds.has(doc._id),
    });
  }
}

Wiring it into the OTLP pipeline

None of this lands in your collector without an exporter and a processor. The minimum viable setup is the OTLP HTTP exporter pointed at your collector or backend, wrapped in a `BatchSpanProcessor` so spans flush in batches instead of one network call per span. Register a `Resource` with `service.name` so your agent's traces are attributable in a backend that's collecting from a dozen services.

The one knob worth getting right early is sampling. Agent turns are expensive to trace fully, a parent span, several retrieval spans with twenty events each, a generation span with the full messages array. Head sampling at a fixed ratio is the blunt default, but for agents you usually want every errored or low-confidence turn kept regardless of the ratio. That's tail sampling, and it lives in the collector config, not the SDK: sample 100% of traces where the turn span has an error status, plus a small percentage of the rest. The retrieval events you spent effort recording are precisely the data you want retained on the failures, so don't let head sampling drop them.

The SDK setup below is the Node bootstrap you import once at process start. The collector-side tail-sampling policy is the second half of the story and belongs in your `otel-collector-config.yaml`.

⚠️

Head sampling drops your evidence

A 10% head-sampling ratio means 90% of your wrong-answer traces, and the retrieval events that explain them, never reach the backend. For agents, configure tail sampling in the collector to keep 100% of error-status turns. The traces you most need to debug are exactly the ones a fixed ratio throws away.

Node SDK bootstrap with OTLP export

Import once at process start. Pair with a tail-sampling policy in the collector so errored turns are always kept.

import { NodeSDK } from '@opentelemetry/sdk-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';
import { BatchSpanProcessor } from '@opentelemetry/sdk-trace-base';
import { resourceFromAttributes } from '@opentelemetry/resources';
import { ATTR_SERVICE_NAME } from '@opentelemetry/semantic-conventions';

const sdk = new NodeSDK({
  resource: resourceFromAttributes({
    [ATTR_SERVICE_NAME]: 'agent-api',
  }),
  spanProcessors: [
    new BatchSpanProcessor(
      new OTLPTraceExporter({
        url: 'https://collector.internal:4318/v1/traces',
      }),
    ),
  ],
});

sdk.start();

process.on('SIGTERM', () => {
  sdk.shutdown().finally(() => process.exit(0));
});

Instrumenting the actual retrieval call

The span structure is only as honest as the call it wraps. If your retriever is a vector search over a Pinecone index, the span should carry the index name, the top-k, and the returned IDs and scores. If it's a query against your structured content, the span should carry the query itself, and here is where the trace stops being a black box, because a structured query is legible in a way an opaque embedding lookup is not.

When the retrieval source is Sanity Context, the agent reaches structured content through a GROQ query, over the hosted Context MCP endpoint by default, or through a thin `tool()` wrapper that runs a typed query when you want full control. The thing worth knowing for telemetry: most production retrieval here is structured. GROQ predicates and schema lookups, with semantic similarity as an optional layer that's off by default and that most projects never turn on. That matters for span design, because a structured query is self-documenting. You record the GROQ string on the span and a reviewer can read it, the filters, the ordering, the projection, without re-running anything.

The canonical retrieval call combines a structural filter (a predicate inside the `*[ ... ]` brackets) with relevance ordering via the `score()` pipeline. The query below filters published articles, scores them by keyword match and semantic similarity, and orders by the resulting `_score`. Record this exact string as `retrieval.query` on the span, capture the returned `_id` and `_score` per document as events, and your trace now shows the full retrieval decision, the filter that constrained the candidate set and the ranking that ordered it.

Wrapping a Sanity Context GROQ retrieval in a span

The GROQ string goes on the span verbatim. A structural predicate constrains candidates; score() ranks them — both are legible in the trace.

import { trace } from '@opentelemetry/api';
import { createClient } from 'next-sanity';

const sanity = createClient({
  projectId: process.env.SANITY_PROJECT_ID!,
  dataset: 'production',
  apiVersion: '2026-02-19',
  useCdn: false,
});

const RETRIEVAL_QUERY = `
*[_type == "article" && publishStatus == "published"]
  | score(
      title match text::query($queryText),
      boost(body match text::query($queryText), 0.5),
      text::semanticSimilarity($queryText)
    )
  | order(_score desc)[0...10]{ _id, _score, title }`;

const tracer = trace.getTracer('agent-retrieval');

export async function retrieve(queryText: string) {
  return tracer.startActiveSpan('sanity.groq.query', async (span) => {
    span.setAttribute('db.system', 'sanity');
    span.setAttribute('retrieval.query', RETRIEVAL_QUERY);
    span.setAttribute('retrieval.query_text', queryText);
    try {
      const docs = await sanity.fetch(RETRIEVAL_QUERY, { queryText });
      span.setAttribute('retrieval.candidate_count', docs.length);
      return docs;
    } finally {
      span.end();
    }
  });
}

Reading the trace: separating retrieval failures from generation failures

With retrieval instrumented this way, debugging a wrong answer follows a fixed path. Open the trace for the bad turn. Look at the retrieval span first. The `retrieval.result` events show you the ranked candidate list with scores. Two outcomes, two different bugs.

If the document that should have answered the question isn't in the candidate list at all, or it's there but scored low and got cut, that's a retrieval failure. The model never had the right context. Now the question is why the right document didn't surface, and the span tells you: maybe a structural filter was too tight and excluded it, maybe the query text didn't match the document's vocabulary and a keyword-only ranking missed it. This is the case where you reach for the optional semantic layer, `text::semanticSimilarity()` inside the same `score()` pipeline, so a query phrased differently from the document still ranks. You turn that on because the trace showed you a specific failure, not because hybrid search is a default worth paying for everywhere.

If the right document is right there at rank one with a healthy score, and the answer is still wrong, retrieval did its job. The bug is downstream: prompt construction, truncation that cut the relevant passage, or the model synthesizing badly from good context. You stop poking at the retriever and start looking at what got assembled into the prompt, which, if you captured the selection on the span, is also visible. The whole value of the span design is this fork in the road. Without it, every wrong answer is one undifferentiated mystery. With it, the trace tells you which half of the system to fix before you write a single line of debugging code.

💡

Log the query, not just the embedding

A structured GROQ query reads as plain text on the span, a reviewer sees the filters and ordering and understands the retrieval without re-running it. An opaque embedding lookup forces you to reconstruct what the agent saw. Prefer the legible artifact in your traces; it's the difference between reading a failure and reproducing it.