Vector & Retrieval

When to use Chroma vs Sanity Context for AI retrieval

Chroma

Open-source embedding database for AI apps, store, embed, and query documents with a few lines of Python or JS, no infra to babysit.

Visit Chroma

You started with Chroma because it got out of your way. `chromadb.PersistentClient()`, one `add()` call, one `query()` call, and your RAG prototype was answering questions by lunch. That's the whole pitch and it's a good one, Chroma is the fastest path from "I have some documents" to "my agent can search them."

Where Chroma ends, structure begins. Sanity Context fills that gap: its Context MCP endpoint answers GROQ queries against typed schemas, so predicates like publication state or date are first-class, not afterthoughts bolted onto similarity scores.

Then the demo became a product. Your collection grew past a few hundred thousand chunks. Queries that felt instant now take a beat. Someone asks "show me the Q3 pricing doc, not the Q2 one" and the nearest-neighbor search cheerfully returns Q2 because the embeddings of the two docs are nearly identical. A user filters by author and date, and you realize your `where` clauses are doing structured filtering on top of a system that was built for similarity, not predicates.

None of this means Chroma was the wrong call. It means you've hit the seam between what an embedding store does well and what your application actually needs. This article maps that seam: where Chroma is genuinely the right answer, the failure modes that show up at scale, and how to tell when a query's structural component, date, author, publication state, variant, is the thing tanking your retrieval, not the embeddings.

Where Chroma earns its place: fast, local, zero-infra retrieval

Chroma's reason to exist is the gap between "I have documents" and "I can query them by meaning", and it closes that gap in about five lines. You don't provision a cluster, you don't write a schema, you don't think about index types. You add documents, optionally with metadata, and you query with text. Chroma handles the embedding (default is `all-MiniLM-L6-v2` via its bundled ONNX runtime) or takes your own vectors if you've already computed them.

This is the right tool when your corpus is genuinely unstructured and machine-scale: scraped pages, transcripts, log lines, support tickets, chunks of PDFs where the only thing you want to ask is "what's semantically near this query." It's also the right tool for a prototype, full stop. The `PersistentClient` writes to a local DuckDB+Parquet store (or SQLite in newer versions), so you can ship a working RAG loop with no network dependency and no bill.

The place developers get burned is treating the prototype's ergonomics as a guarantee they'll hold at 10M vectors and 50 QPS. They won't, and that's fine, Chroma never promised they would. Knowing the boundary is the point.

A complete Chroma RAG loop

From empty store to working semantic search in one file.

import chromadb
from chromadb.utils import embedding_functions

client = chromadb.PersistentClient(path="./chroma_store")

# Chroma embeds for you with a default sentence-transformer
embedder = embedding_functions.DefaultEmbeddingFunction()

collection = client.get_or_create_collection(
    name="docs",
    embedding_function=embedder,
)

collection.add(
    documents=["Q3 pricing went up 8%", "Q2 pricing held flat"],
    metadatas=[{"quarter": "Q3", "type": "pricing"},
               {"quarter": "Q2", "type": "pricing"}],
    ids=["doc-q3", "doc-q2"],
)

results = collection.query(
    query_texts=["how did pricing change last quarter"],
    n_results=2,
)
print(results["documents"])

The failure mode at scale: when the query has a structural component

Here's the query that breaks naive Chroma usage: "Show me the Q3 pricing doc." You embed that text, you run nearest-neighbor, and you get back both the Q3 and Q2 pricing docs ranked almost identically, because the sentences "Q3 pricing went up 8%" and "Q2 pricing held flat" live a hair apart in embedding space. The model captured "this is about pricing." It did not reliably capture "this is specifically Q3," because a single token's difference barely moves a 384-dimension vector.

The instinct is to reach for Chroma's `where` filter. And `where` works, for exact metadata matches you stored at ingest time. `collection.query(query_texts=[...], where={"quarter": "Q3"})` will correctly scope the result. The problem is everything that isn't an exact match you anticipated: ranges (`published_at` in the last 30 days), combinations (`author == X AND status == published AND NOT draft`), or fields you didn't think to denormalize into metadata. Chroma's metadata filtering is a post- or pre-filter bolted onto a similarity index, not a query planner. Complex predicates either aren't expressible or force you to over-fetch and filter in Python.

The deeper issue is architectural. You have two retrieval problems wearing one trenchcoat: a *semantic* problem ("about pricing") and a *structural* problem ("Q3, published, by finance"). Chroma is excellent at the first and adequate-at-best at the second. When the structural component is what determines correctness, and for anything editorial, it usually is, pure vector search returns confidently wrong documents.

⚠️

Embedding distance ignores structure

Two documents that differ only in a date, a status flag, or a product variant often sit within rounding distance in embedding space. If correctness depends on that distinction, no amount of re-ranking the vectors will fix it, you need the structural predicate to run as a filter, not as a hope baked into the embedding.

Operating Chroma in production: the parts the quickstart skips

Once you're past the prototype, three operational realities show up. First, persistence and concurrency. `PersistentClient` is single-process; the moment you have multiple workers or a serverless function fanning out, you need Chroma in client/server mode (`chroma run --path ./store`) and a `HttpClient`. That's a deployment unit you now own, uptime, backups, the works.

Second, the embedding function is part of your data contract. If you ingested with `all-MiniLM-L6-v2` and later query with OpenAI's `text-embedding-3-small`, your vectors live in different spaces and your results are garbage with no error thrown. Pin the embedding function per collection and treat changing it as a full re-index, not a config tweak.

Third, metadata is your only escape hatch for structure, so you pay for it at ingest. Every filterable field has to be flattened into the `metadatas` dict at `add()` time. There's no join, no reference resolution, no computed field. If your source of truth is a CMS where an article references an author document which references a team, you flatten that whole graph into strings before it ever reaches Chroma, and you re-flatten on every edit. That sync job is where a lot of "why is the agent showing stale data" bugs are born.

Client/server mode with a pinned embedding function

The production setup the quickstart doesn't mention: a server to run, an embedding contract to honor, and a flat metadata schema to maintain.

import chromadb
from chromadb.utils import embedding_functions

# chroma run --path ./chroma_store --port 8000
client = chromadb.HttpClient(host="localhost", port=8000)

openai_ef = embedding_functions.OpenAIEmbeddingFunction(
    api_key=os.environ["OPENAI_API_KEY"],
    model_name="text-embedding-3-small",
)

# This embedding fn is now a hard contract for the collection.
# Querying it with any other model returns silent garbage.
collection = client.get_or_create_collection(
    name="docs",
    embedding_function=openai_ef,
)

# Every filterable field must be flattened at write time
collection.add(
    documents=[article_body],
    metadatas=[{
        "author": "finance-team",
        "quarter": "Q3",
        "status": "published",
        "published_at": 1727740800,  # unix ts for range filters
    }],
    ids=["article-123"],
)

The fix: stop choosing between semantic and structural retrieval

The real fix is to stop pretending you have to pick one. The query "the Q3 pricing doc about enterprise tiers" is a *structural filter* (quarter == Q3, status == published) plus a *semantic rank* (about enterprise pricing). The retrieval that works runs both in one pass: predicates narrow the candidate set to documents that are structurally correct, then similarity orders what's left.

For a structured corpus, anything with a schema, like a content catalog, articles, products, this is where Sanity Context fits under your agent. Instead of flattening your content graph into a Chroma metadata dict and keeping a sync job alive, you query the content where it already lives, with the structure intact. A GROQ query expresses the structural predicate as a real filter inside `*[ ... ]` and the semantic component as a score, in one query, against content that's still typed and still references its related documents.

Worth being precise about the typical case, because it's usually misframed: most retrieval against Sanity Context is *structured*, GROQ filters and schema lookups, not vector search. Embeddings are opt-in and off by default; plenty of production agents never turn them on because the structural query already returns the right document. Semantic similarity is the layer you add when the structural filter narrows the set but you still need to rank within it. The discipline is hybrid: predicates first, BM25 keyword match, embeddings only where the failures justify the cost.

Structural predicate + semantic rank in one GROQ query

The structural filter (status, quarter) lives in the predicate; BM25 and semantic similarity combine in score(); the author reference resolves inline — no flattening, no sync job.

*[_type == "article"
  && status == "published"
  && quarter == "Q3"
  && pt::text(body) match text::query($queryText)
]
| score(
    boost(quarter == "Q3", 2),
    text::semanticSimilarity($queryText)
  )
| order(_score desc)[0...5]{
    title,
    "author": author->name,
    publishedAt,
    _score
  }

Wiring it into your agent: MCP first, custom tools second

You don't rewrite your agent loop to get this. Sanity Context ships a hosted, read-only Context MCP endpoint, the fastest way in. Your agent attaches it as an MCP server and gets schema-aware retrieval tools without you hand-writing query plumbing. Because it's read-only, the agent can search, fetch, and resolve references but can't mutate your content; writes go through a separate path (Agent Actions), which is the boundary you want around a corpus humans are editing.

If you want full control of the query, say you're co-locating semantic and structural retrieval exactly as above, the second path is a thin custom tool that runs a typed GROQ query through `next-sanity` or `@sanity/client`. Same content, you just own the query string. Either way the integration is additive: Chroma can keep serving the corpora it's good at while the structured, governed content moves to a layer that speaks its native shape.

This is also the routing rule worth internalizing. Structured content with a schema, your catalog, your articles, goes to GROQ retrieval. Genuinely unstructured sources, PDFs, scraped sites, support exports, are what Knowledge Bases ingest and order into queryable documents. And a high-volume, machine-generated corpus that never needs editorial review or human-previewable releases? That's exactly where keeping a dedicated vector store like Chroma still makes sense. Not everything belongs in a content layer.

Attach Context MCP, or run a typed GROQ tool yourself

MCP for the default integration; a typed GROQ tool when you want to write the query yourself.

import { experimental_createMCPClient } from "ai";
import { createClient } from "next-sanity";

// Path 1: the fastest way in — attach the hosted MCP endpoint
const mcp = await experimental_createMCPClient({
  transport: {
    type: "sse",
    url: "https://mcp.sanity.io/<projectId>/<dataset>",
  },
});
const tools = await mcp.tools(); // schema-aware, read-only

// Path 2: full query control via a thin custom tool
const sanity = createClient({
  projectId: process.env.SANITY_PROJECT_ID!,
  dataset: "production",
  apiVersion: "2024-10-01",
  useCdn: false,
});

async function searchArticles(queryText: string) {
  return sanity.fetch(
    `*[_type == "article" && status == "published"
       && pt::text(body) match text::query($queryText)]
     | score(text::semanticSimilarity($queryText))
     | order(_score desc)[0...5]{ title, "author": author->name }`,
    { queryText }
  );
}

Deciding the split: a routing rule you can defend

You don't have to migrate off Chroma to fix the structural-retrieval problem, and you shouldn't try to. The useful move is to route each corpus to the system that matches its shape, then debug retrieval like the failure it actually is.

Keep in Chroma: high-volume, machine-generated, genuinely unstructured vectors where the only question is similarity and where no human edits, approves, or previews the content. Log streams, embedding caches, dedup indexes. Chroma's local-first ergonomics and low operational floor are a real advantage here, and dragging that data into a governed content layer would just add friction.

Move to Sanity Context: anything editorial. Content with a schema, references between documents, a publication state, an author, a release that should be previewable before it goes live. The moment correctness depends on a structural predicate, and the moment a human needs to edit, version, or govern what the agent retrieves, you've outgrown what an embedding store was built to do.

One last operational note for whichever side you land on: when an agent gives a wrong answer in production, the cause is far more often a bad retrieval than a bad model. Log what the agent *saw*, the query you ran and the documents it got back, not just what it said. A traced GROQ query and its result, or a logged Chroma `query()` and its hits, is the single most useful artifact when you're staring at a hallucination at 2am.

💡

Trace the retrieval first

Before you swap models or tune a prompt, log the exact retrieval: the query text, the filter, and the documents returned with their scores. Most production hallucinations trace back to the agent confidently reasoning over the wrong document, which is a retrieval bug, not a generation bug.