Observability & Evaluation

Trace agent calls into Sanity Context with Weights & Biases

Weights & Biases

Track agent runs in W&B Weave, trace every LLM call, tool invocation, and retrieval result so you can find which step breaks before your eval scores tank.

Visit Weights & Biases

Your eval suite passes locally, the demo nails every question, then a week into production the agent starts confidently citing a product that was discontinued in March. The W&B Weave dashboard shows the run completed, the model returned a fluent answer, latency was fine. Nothing is red. The failure is invisible because you logged the output, not the inputs the model actually saw.

The inputs the model saw came from somewhere. For most production agents, that somewhere is a structured retrieval call, and in this stack it's Context MCP, the read-only endpoint in Sanity Context that answers GROQ queries against your live schema. That's the span Weave needs to see.

That's the trap with agent observability: the part that breaks first is almost never the model. It's retrieval. The agent asked for "current pricing," your tool returned a draft document or a stale cache, and the LLM did exactly what it should with bad context. Weave can show you this, but only if you instrument the retrieval call as a first-class span, not as an opaque blob inside the generation.

This article covers how to trace agent runs in Weave so the failing step is obvious: structuring your @weave.op spans, attaching retrieval inputs to the trace, and wiring eval scorers that catch context drift. Then we'll connect the retrieval span to where the content actually lives, so when Weave tells you the agent saw the wrong document, you can trace it back to a query you can read and fix.

Why the run looks green when the answer is wrong

Weave traces a run as a tree of ops. You decorate a function with @weave.op, and every call becomes a span with inputs, outputs, latency, and token counts rolled up the tree. The default instinct is to decorate the top-level agent function and the LLM call, then ship. The dashboard fills with green runs and you move on.

The problem: a green run only means no exception was raised. An agent that retrieves the wrong document and answers fluently from it produces a clean trace with good latency and a confident output. There's nothing for Weave to flag because no contract was violated, the model did its job on the context it was handed.

The failure lives one level down, in the retrieval step, and it's invisible if that step isn't its own op. If your agent does `context = retrieve(query)` inside a larger `answer()` op without decorating `retrieve`, Weave sees one span: question in, answer out. You can't inspect what got pulled. When the answer is wrong you're left re-running the query by hand and guessing.

The fix is to make retrieval a span you can read. Decorate it, and log the actual query the agent issued plus the documents it got back as structured outputs, not a stringified summary. Then a wrong answer becomes a two-click investigation: open the run, expand the retrieval span, see that `current pricing` returned a draft from six weeks ago. The model was never the problem.

Make retrieval its own Weave op

Decorating retrieve() separately gives you a span you can expand to see exactly what the model received.

import weave

weave.init("support-agent")

@weave.op()
def retrieve(query: str) -> list[dict]:
    # the documents the model will actually see this turn
    docs = content_source.search(query)
    return docs  # logged as structured output, not a summary string

@weave.op()
def answer(question: str) -> str:
    docs = retrieve(question)
    completion = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": format_context(docs)},
            {"role": "user", "content": question},
        ],
    )
    return completion.choices[0].message.content

answer("What's the current price of the Pro plan?")

Attach the inputs to the trace, not just the output

Once retrieval is its own op, the next mistake is logging too little of it. Returning a list of document IDs feels clean, but six weeks later when you're debugging you can't tell whether ID `doc_8821` was the right answer or a stale draft. You need enough on the span to reconstruct the model's view of the world without re-running anything.

Weave gives you two levers here: structured op outputs and `weave.attributes`. Use the op output for the documents themselves, title, body excerpt, and crucially any state field like `publishedAt` or `status`. Use attributes for metadata about the retrieval itself: the raw query, how many candidates were scored, which retrieval mode ran. Attributes attach to the span without polluting the function's return type, so your production code stays clean.

The reason `status` and `publishedAt` matter more than relevance score: most retrieval failures in production aren't semantic mismatches. They're structural. The agent retrieved a document that's semantically perfect but was a draft, or superseded, or scoped to a different region. A pure similarity score of 0.94 tells you nothing about that. The field that would have caught it, `status == "published"`, is the one developers forget to log because the vector DB doesn't return it.

Log the structural fields and the wrong-document bugs surface on sight. You open the span, see `status: "draft"` next to a 0.94 score, and you know the retrieval predicate was missing a filter, not that the embeddings were bad.

âš ī¸

A high similarity score is not a correct retrieval

The most common production failure is a document that scores 0.9+ on semantic similarity but is structurally wrong, a draft, a superseded version, the wrong locale. If your retrieval span only logs the score, this bug is invisible in Weave. Always log the state fields (status, publishedAt, locale) the model's answer depends on.

Turn the failure mode into an eval scorer

Tracing finds the bug after it happens. To stop it shipping you encode the failure as a Weave Evaluation. The instinct is to write a scorer that grades the final answer against a reference, useful, but it tells you the answer was wrong, not why. The more valuable scorer grades the retrieval step directly.

Build a dataset of questions paired with the document IDs that should have been retrieved. Then write a scorer that runs against the retrieval span's output and checks two things: did the right document appear in the candidates, and were any structurally-invalid documents (drafts, superseded versions) in the set the model saw. The second check is the one that catches the silent failures, because a model handed one good doc and one stale draft will often blend them.

Weave's `Evaluation` runs your scorer across the dataset and gives you a regression surface. When you change the retrieval query, swap embedding models, or adjust a filter, you rerun the eval and watch the retrieval-recall and the no-stale-docs metrics move independently. A change that improves answer quality but starts leaking drafts shows up as a split: answers up, stale-doc-rate up. That split is exactly what a single answer-quality score hides.

A Weave scorer that grades retrieval, not just the answer

Grading the retrieval span surfaces stale-doc leaks that a final-answer scorer averages away.

import weave
from weave import Evaluation

@weave.op()
def retrieval_scorer(expected_doc_id: str, output: dict) -> dict:
    retrieved = output["docs"]
    ids = [d["_id"] for d in retrieved]
    stale = [d for d in retrieved if d.get("status") != "published"]
    return {
        "recall_hit": expected_doc_id in ids,
        "stale_docs_leaked": len(stale),
    }

evaluation = Evaluation(
    dataset=[
        {"question": "current Pro price?", "expected_doc_id": "pricing_2026"},
        {"question": "refund window?", "expected_doc_id": "policy_refunds"},
    ],
    scorers=[retrieval_scorer],
)

await evaluation.evaluate(answer_with_retrieval)

The trace points at retrieval, now make retrieval readable

Weave has done its job. You've narrowed every confident-but-wrong run to one span: the agent asked for current pricing and got back a draft. The remaining problem is on the other side of that span. Your retrieval call is a black box that returns document IDs from a vector index, and "fix the retrieval" means re-embedding a corpus, tuning a `top_k`, and hoping the structural filter you bolt on actually maps to a field in the source.

The root cause is usually that the query the agent issued couldn't express the constraint that mattered. "Current pricing" has a structural component, published, not draft; this fiscal year, not last, that pure vector similarity can't resolve. The embedding doesn't know what `status` means. So the filter has to live somewhere the source content actually models those fields.

This is where the retrieval span connects to Sanity Context. Sanity Context exposes your structured content through GROQ, a query language where the structural predicate and the semantic match live in one expression. The state filter isn't a post-hoc bolt-on, it's a predicate inside `*[ ... ]`, evaluated against fields the content is genuinely typed with. The span you log in Weave becomes a query you can read, paste into a query tool, and run, instead of an opaque vector lookup. When Weave shows a stale doc leaked, you look at the GROQ, see the missing `&& status == "published"`, add it, and the eval confirms the leak is gone.

â„šī¸

Most calls are structured, not semantic

Production data on agents running over Sanity Context shows the heavy majority of retrieval calls are structured, GROQ queries and schema lookups, with semantic search a small slice. Embeddings are opt-in and off by default. Reach for vector similarity when the agent's failures justify it, not as the default for every lookup.

One query that does the structural filter and the semantic rank

The retrieval bug you traced in Weave was a missing constraint. The fix is a single GROQ query where the structural predicate gates the candidate set and a score expression ranks what survives. Structural filters go inside the `*[ ... ]` brackets as predicates. The semantic part runs in a `score()` pipeline that combines a keyword match and, when you've opted into embeddings, similarity, then you order by `_score`.

The key property for your Weave traces: this whole thing is one readable expression. Log it as the input on your retrieval span and the trace tells the full story. You can see that the query asked for published documents, scored them by relevance to the user's text, and returned the top match. No separate "first we filtered, then we re-ranked, then we post-filtered" reconstruction across three services. When something leaks, the predicate that should have stopped it is right there in the span.

The default path is the Context MCP endpoint, a hosted, read-only MCP server your agent loop attaches like any other tool. The agent gets schema-aware retrieval tools without you writing query plumbing, and the queries those tools run are the thing you log to Weave. For full control you wrap a typed GROQ query in your own tool and log the query string yourself. Either way the retrieval span stops being a black box.

Structural predicate plus semantic rank in one GROQ query

status == "published" is a predicate inside the filter; semantic ranking happens in score(). Log this string on your Weave retrieval span.

*[_type == "article" && status == "published"]
  | score(
      boost(title match text::query($queryText), 3),
      text::semanticSimilarity($queryText)
    )
  | order(_score desc)
  [0...5]
  { _id, title, status, publishedAt, "excerpt": body[0...280] }

When the corpus is messy, and when a vector DB still wins

Not every retrieval problem is a structured-content problem, and pretending it is leads to the opposite failure. If your agent answers from PDFs, scraped marketing pages, or a support ticket archive, content with no schema, no `status` field, no clean type, GROQ predicates have nothing to bite on. For that, Sanity Context Knowledge Bases (launched September 2026) is the path: it turns messy sources into well-ordered documents with a table of contents the agent can navigate, so the retrieval span gets back a structured chunk instead of a raw blob.

There's also a case where you should keep your existing vector DB and not route through Sanity at all. High-volume, machine-generated corpora that no human will ever edit or approve, logs, event streams, embeddings of millions of auto-generated records, don't need editorial governance. A dedicated Pinecone or pgvector index is the right tool, and your Weave traces should just log that retrieval honestly. The point isn't to centralize everything; it's to make each retrieval span readable enough that the failing one is obvious.

The rule of thumb that keeps your traces clean: structured, governed content the agent quotes as fact, pricing, policies, product specs, belongs behind GROQ where the structural filter is enforceable and the query is loggable. Unstructured source material belongs in Knowledge Bases. Throwaway machine data stays in your vector DB. Get the routing right and every retrieval span in Weave tells you not just what the agent saw, but why it was allowed to see it.

💡

Log the query, not a summary of it

Whichever path your retrieval takes, GROQ, Knowledge Bases, or a vector DB, attach the exact query string and the returned documents' state fields to the Weave span. A debuggable trace is one you can replay without rerunning the agent. A summary string forces a reproduction; the raw query and result let you fix it on sight.