When to use Weaviate vs Sanity Context for AI retrieval

Your Weaviate hybrid search returns the right document in the demo and the wrong one in production. The user asks "Q4 pricing for the enterprise plan" and you get a marketing page about pricing philosophy, because the embedding matched "pricing" with high cosine similarity and the `alpha` knob you tuned weeks ago now favors vectors over keywords. Nothing crashed. The score just lied.

That failure has a structural cause. Sanity Context exposes a CMS-native query path via Context MCP where GROQ predicates, BM25, and optional vectors compose in a single call, so "status == published" is a hard filter, not a similarity hint.

Hybrid search in Weaviate is real and good. `Get` with `hybrid` fuses BM25 and vector results, and `alpha` lets you slide between them. But hybrid in this sense means "two flavors of fuzzy ranking blended together." It does not mean the query understands that "Q4" is a date range, "enterprise" is a plan variant, and "pricing" should only match documents whose `status == "published"`. Those are structural facts. Embeddings approximate them; they don't enforce them.

This article is about that gap. We'll look at how Weaviate hybrid scoring actually works, where the `alpha` parameter stops helping, and when the failing query is really a structured-retrieval problem wearing a semantic-search costume, the case where a CMS-native hybrid (structural predicates plus BM25 plus optional vectors, in one query) gets you the answer the vector DB couldn't.

How Weaviate hybrid search actually scores a result

Weaviate's `hybrid` operator runs two retrievals and fuses them. The vector side does an ANN search over your embeddings; the keyword side runs BM25 over the inverted index. Then a fusion algorithm, by default `relativeScoreFusion` since v1.20, previously `rankedFusion`, combines the two ranked lists into a single `_additional.score`. The `alpha` parameter weights the blend: `alpha=1` is pure vector, `alpha=0` is pure keyword, `alpha=0.5` is even.

The important detail is what each side is blind to. BM25 sees tokens; it has no idea that two differently-worded sentences mean the same thing. The vector side sees meaning; it has no idea that 'enterprise' is a discrete plan name and not just a word that's semantically near 'business' and 'corporate'. Fusion averages these two blindnesses. It does not add a third capability that resolves structure.

So when your query has a structural component, a date, an author, a status flag, a product variant, neither side handles it. You either pre-filter with a `where` clause (which Weaviate does support, and you should use) or you hope the embedding leaked enough of that structure to rank correctly. The second path is where the silent wrong answers come from.

A typical Weaviate hybrid query in the TS client

Hybrid blends BM25 and vector scores via alpha. Note there's no structural understanding of 'Q4' or 'enterprise' anywhere in this call.

import weaviate from 'weaviate-client'

const client = await weaviate.connectToWeaviateCloud(process.env.WCD_URL!, {
  authCredentials: new weaviate.ApiKey(process.env.WCD_API_KEY!),
})

const articles = client.collections.get('Article')

const result = await articles.query.hybrid('Q4 pricing for the enterprise plan', {
  alpha: 0.5, // 0 = keyword only, 1 = vector only
  limit: 5,
  returnMetadata: ['score', 'explainScore'],
})

for (const obj of result.objects) {
  console.log(obj.metadata?.score, obj.properties.title)
}

Why tuning `alpha` is the wrong lever for structural queries

The first instinct when hybrid returns the wrong doc is to reach for `alpha`. Bump it toward keyword because 'enterprise' should match literally; bump it toward vector because the user phrased the question loosely. Both moves help some queries and break others. You are tuning one global scalar against a query distribution that has multiple distinct shapes, and there is no single value that's right for all of them.

Weaviate gives you a real escape hatch: the `where` filter. Move the structural part of the query out of the fuzzy ranking entirely and into a hard predicate. 'Published only' is `status == 'published'`. 'Q4' is a `publishedAt` range. 'Enterprise plan' is `planVariant == 'enterprise'`. Filter first, then let hybrid rank what survives. This is the single highest-leverage fix and most teams under-use it because the embedding 'mostly works' until it doesn't.

The limitation you'll hit next is that the filter and the document live in two systems. Your structured facts, status, variant, publish date, author, usually originate in a CMS or an app database. To filter in Weaviate you have to copy those fields into the Weaviate object at ingest time, then keep them in sync as editors change them. The filter is only as fresh as your last sync job. When an editor unpublishes a doc and your sync runs hourly, your agent serves a retracted document for up to an hour, and the score looks perfect.

Moving structure out of the ranker and into a where filter

Hard predicates resolve structure that alpha can't. The catch: status/planVariant/publishedAt all had to be synced into Weaviate from wherever they actually live.

import { Filters } from 'weaviate-client'

const result = await articles.query.hybrid('pricing for the enterprise plan', {
  alpha: 0.5,
  limit: 5,
  filters: Filters.and(
    articles.filter.byProperty('status').equal('published'),
    articles.filter.byProperty('planVariant').equal('enterprise'),
    articles.filter.byProperty('publishedAt').greaterThan(new Date('2026-10-01')),
  ),
  returnMetadata: ['score'],
})

The hidden cost: keeping the inverted index in sync with the source of truth

Every structural field you filter on in Weaviate is a field you've duplicated from somewhere else. The vector and the BM25 index are derived artifacts. The source of truth for 'is this published', 'what plan is this', 'who wrote it, when' is almost always your content system, not the vector store.

That duplication creates a class of bug that's hard to see in traces because nothing errors. An editor fixes a price typo. The embedding doesn't change much, same topic, same words, so your nightly re-embed skips it or reranks it identically. The corrected number never reaches the index. Or a doc is unpublished but the delete event gets dropped, and it keeps ranking. These are correctness failures that hybrid scoring will never surface, because the score is computed over stale-but-confident data.

The production discipline here is to treat the vector DB as a cache, not a database. It's excellent for high-volume, machine-generated, or genuinely unstructured corpora where there's no human editorial loop. It's a liability when the content has governance, approvals, publish state, versioning, because the governance lives upstream and the index doesn't know about it.

⚠️

The sync gap is invisible in your traces

A stale Weaviate object returns a high score and a clean response. No exception, no timeout, no log line. The only signal is a user reading a retracted or wrong-priced document. If your content has an editorial publish/unpublish workflow, the freshness of your filter is bounded by your sync cadence, and that bound is the thing your observability dashboard won't show you.

Reframing: this is a structured-retrieval problem wearing a semantic costume

Step back and look at the failing query again: 'Q4 pricing for the enterprise plan.' How much of that is actually semantic? 'Pricing' is the only genuinely fuzzy token, the user might say 'cost', 'rates', 'what it runs'. Everything else is structure: a quarter, a plan variant, an implicit publish-state. The query is maybe 20% semantic and 80% predicate, but you routed 100% of it through a system whose core competency is the 20%.

Production data on agent retrieval bears this out. Across real deployments the heavy majority of retrieval calls are structured, filters and field lookups against well-modeled content, with semantic search a small slice reserved for the queries that genuinely need it. Vector search and 'RAG' are not synonyms for good retrieval; they're one ingredient. Hybrid is the right instinct, but the right hybrid is 'structural predicates + keyword + optional vectors,' not 'two kinds of fuzzy.'

The practical consequence: if your content is already modeled, articles with fields, products with variants, docs with a publish state, the cheapest correctness win is to run the structural part of retrieval against the modeled source directly, where the predicates are exact and always current, and reach for embeddings only on the genuinely fuzzy residue.

ℹ️

Hybrid means more than vector + keyword

The discipline that holds up in production is structural predicates first (date ranges, variants, publish state, exact, not approximate), BM25 for keyword precision, and embeddings as an opt-in layer for the queries that actually need semantic recall. Most modeled-content retrieval never needs the embeddings at all.

Running the structural half of hybrid where the content actually lives

If the structured side of your query is the part that keeps failing, run it against the system that owns the structure. Sanity Context exposes your modeled content as schema-aware retrieval, and the fastest way to wire it into an agent is the hosted Context MCP endpoint, a read-only MCP server your loop attaches the same way it attaches any other tool. The agent gets typed query tools against live content, so 'published only', 'Q4', and 'enterprise variant' are exact predicates evaluated against the source of truth, not stale copies in an index.

The retrieval language is GROQ, and it does hybrid the way the previous section described: structural predicates as hard filters inside `*[ ... ]`, BM25 keyword matching, and, only when you opt in, semantic similarity, all fused by a single `score()` pipeline. Embeddings are off by default; most projects running on Context MCP never turn them on because the structural layer already answers the query.

The operational difference from the sync model: there's no index to keep fresh. When an editor unpublishes a doc, the predicate stops matching on the next query. There's no nightly re-embed, no dropped delete event, no hour-long window of serving retracted content.

The same hybrid, structure-first, as a GROQ query

Structural filters live inside the *[ ... ] predicate (exact, always current). BM25 via text::query and optional semantic similarity are fused by score(). semanticSimilarity takes the query TEXT, not an embedding field.

*[_type == "article"
  && status == "published"
  && planVariant == "enterprise"
  && publishedAt > "2026-10-01"
] | score(
  boost(title match text::query($queryText), 3),
  body match text::query($queryText),
  boost(text::semanticSimilarity($queryText) > 0.7, 2)
) | order(_score desc) [0...5] {
  _id, title, publishedAt, "snippet": body[0...200]
}

Where to draw the line: Weaviate, Knowledge Bases, or both

None of this means rip out Weaviate. It means route by content shape. Three cases:

Structured, governed content, your articles, products, docs with a schema and an editorial workflow, should be retrieved structure-first against the source. That's where the predicate-heavy queries live and where stale indexes hurt most. Sanity Context's GROQ retrieval handles this, and via Context MCP your agent calls it as a tool with no sync job in between.

Unstructured 'messy' content, PDFs, scraped websites, support-ticket exports, has no schema to filter on, so the structure-first argument doesn't apply. Sanity Context's Knowledge Bases turns those sources into well-ordered documents with a table of contents you can retrieve against, which is the right path when there's nothing to model.

High-volume, machine-generated, or ephemeral corpora that need no editorial governance, logs, telemetry summaries, large embedding-native datasets, are exactly where a dedicated vector DB like Weaviate stays the right tool. There's no publish state to sync because there are no editors. Keep Weaviate for that, push governed structured content to structure-first retrieval, and you stop fighting `alpha` to compensate for a query that was never semantic in the first place.

The debugging payoff: when an answer is wrong, log the actual retrieval, the GROQ query or the MCP tool call and its result, next to the response. Most production agent failures trace to a bad retrieval, not a bad model, and a structural query you can read is far easier to debug than a fused score you have to reverse-engineer.

✨

No index to keep fresh

Structure-first retrieval against the live source removes the sync job entirely. Unpublish a document and the predicate stops matching on the next query, no re-embed, no dropped delete event, no window of serving retracted content. The vector DB stays in your stack for the corpora that genuinely benefit from it.