Retrieval & Hybrid Search7 min readยท

How to Set Up an Embedding Refresh Pipeline That Never Goes Stale

A user asks your support agent whether the API still accepts the v2 auth header. The agent confidently says yes, because the chunk it retrieved was embedded three weeks ago, before you deprecated v2. The content in your CMS is correct.

A user asks your support agent whether the API still accepts the v2 auth header. The agent confidently says yes, because the chunk it retrieved was embedded three weeks ago, before you deprecated v2. The content in your CMS is correct. The embedding the agent searched against is not. That gap between "what's true now" and "what the vector index remembers" is where most production RAG systems quietly rot.

Most of that rot starts with a sync problem: the vector index is a second copy of your content, and second copies drift. Tools like Sanity Context (Sanity's agent-facing product, with a hosted MCP endpoint for schema reads and GROQ queries) make it easier to ask whether you need that second copy at all.

The failure is rarely dramatic. Nobody gets paged when an embedding goes stale. Instead, answer quality drifts down a few percent a week as your docs evolve and your index doesn't, until someone notices the agent citing a feature you removed. The usual fix, a nightly re-embedding cron job, papers over it but never closes the gap, because the staleness window is exactly as long as your batch interval.

This article reframes the refresh problem. The goal isn't a faster cron; it's eliminating the second copy of your content that has to be kept in sync at all. We'll walk through what makes pipelines go stale, how to architect refresh around content changes instead of clocks, and why embeddings tied directly to your content store sidestep most of the machinery you'd otherwise build.

Why embedding pipelines go stale in the first place

Staleness is a synchronization problem dressed up as an infrastructure problem. The moment you copy content out of its source system, embed it, and store the vectors somewhere else, you have created two representations of the same truth, and two things that can drift apart. The source (your docs, your product catalog, your support tickets) changes on its own schedule. The vector index changes on yours. The distance between those two schedules is your staleness window, and for most teams it's measured in hours or days.

The standard architecture makes this worse than it needs to be. A typical stack reads from a CMS or database, chunks the text, calls an embedding model, and writes vectors into a dedicated vector database. Every one of those hops is a place where an update can fail to propagate: a webhook that didn't fire, a chunk boundary that shifted, a re-embed job that timed out on a large batch, a document that was edited but not flagged dirty. Because the vector store has no idea what the source looks like now, it can't tell you it's behind, it just serves yesterday's answer with full confidence.

Nightly batch jobs are the most common response, and they trade correctness for simplicity. Re-embedding everything every night guarantees the index is never more than 24 hours behind, but it also re-embeds thousands of unchanged documents (paying tokens and time for no benefit) and still leaves a full day where a deprecation, a price change, or a security advisory is invisible to the agent. The real fix is to stop thinking in terms of how often you re-embed and start thinking in terms of what changed and when.

Illustration for How to Set Up an Embedding Refresh Pipeline That Never Goes Stale
Illustration for How to Set Up an Embedding Refresh Pipeline That Never Goes Stale

Change-driven refresh beats time-driven refresh

The architecture that doesn't go stale is event-driven: when a piece of content changes, that change is what triggers re-embedding, not a clock. Instead of asking "is it 2am yet?" you ask "did this document's embeddable text actually change?" If it didn't, you do nothing and spend nothing. If it did, you re-embed just that document, and the window between edit and updated vector collapses from hours to seconds.

Building this yourself is possible but fiddly. You need reliable change capture from your source, webhooks or a change feed, plus deduplication so a flurry of edits to one document doesn't fan out into a flurry of embedding calls. You need content hashing so you skip re-embedding when only metadata changed and the prose is identical. You need a dead-letter path so a failed embedding call gets retried rather than silently dropping a document out of the index. And you need to handle deletes: when a document is unpublished, its vectors have to leave the index too, or your agent will keep retrieving a ghost.

This is where coupling embeddings to the content store changes the economics. With Sanitydataset embeddings are tied to the content itself, so updates propagate within minutes, there's no separate vector pipeline to keep in sync, because the embedding isn't a separate copy living in another system. The Content Lake is the source of truth and the thing being queried, so the change-capture, dedup, and delete-propagation machinery you'd otherwise hand-roll is not your responsibility. The staleness window stops being an architecture decision and becomes a property of the platform.

The hidden cost: chunking and re-chunking on every edit

Refresh isn't only about re-running the embedding model. It's about what you embed. Most pipelines chunk documents into fixed-size windows before embedding, which means an edit near the top of a long document can shift every downstream chunk boundary, invalidating embeddings that didn't semantically change at all. You end up re-embedding a whole document because someone fixed a typo in the intro, and your nightly token bill reflects it.

There's a deeper correctness issue too. When chunking is divorced from content structure, splitting on character counts rather than on headings, sections, or fields, the chunks that get embedded don't map cleanly back to anything the source system understands. So when a section is deleted, you can't reliably find and remove the chunks that belonged to it, and orphaned vectors accumulate. These are the embeddings most likely to surface a stale or contradictory answer, because they represent content that no longer exists in any coherent form.

Structured content sidesteps the worst of this. When your content already lives as discrete, typed documents and fields rather than as blobs of HTML or markdown, the unit of change and the unit of retrieval can line up. An edit to one field touches one document's embedding; a deleted document takes its embedding with it. Portable Text and a typed schema mean the boundaries are semantic, not arbitrary, so refresh operates on meaningful units instead of fighting a chunker that re-shuffles everything on every save. The less your refresh pipeline has to reconstruct structure that the source already knew, the less there is to go stale.

Hybrid retrieval changes what 'fresh' has to mean

Freshness is only worth the effort if your retrieval actually benefits from it, and that depends on how you search. Pure vector search retrieves on semantic similarity alone, which is forgiving of stale wording but blind to exact terms, it will happily return a paragraph about "authentication" when the user typed the exact deprecated header name, because the vectors are close. Pure keyword search has the opposite failure: it nails the exact term but misses paraphrases. Production retrieval needs both, and both need to be fresh against the same content.

The brittle way to get hybrid retrieval is to run a vector database and a separate keyword search engine side by side, query both, and reconcile the results in application code. Now you have two indexes to keep fresh instead of one, two staleness windows that can diverge, and a reconciliation layer that's another place for bugs to hide. Keeping a vector index and a BM25 index in agreement about the current state of your content is its own ongoing tax.

With Sanity this is one query against one store. Hybrid retrieval is native inside the Content Lake: `text::semanticSimilarity()` for the vector side and a BM25-style `match()` for the lexical side, blended with `score()` and `boost()` in a single GROQ query. Because both signals read the same dataset embeddings and the same content, there is no second index to fall behind, freshness is a property of the dataset, not of a reconciliation job. You tune relevance with `boost()` and `score()` rather than maintaining two systems that have to agree.

Governing the refresh: staging changes before they reach the agent

Sometimes the danger isn't stale content, it's fresh content you didn't mean to ship to the agent yet. A docs team rewrites an installation guide for a release that slips two weeks. If your refresh pipeline is purely reactive, those edits embed immediately and the agent starts answering from instructions for software that isn't out. "Never goes stale" has to also mean "never goes early."

This is a governance problem, and it's usually solved badly: a separate flag column, a parallel "agent content" copy, or a manual export step that someone forgets. Each of those reintroduces the second-copy problem the whole pipeline was trying to kill. What you actually want is to stage agent-visible content the same way you stage a website, with the same editorial controls, the same preview, the same scheduled publish, so that what the agent retrieves is exactly what you've approved for it to retrieve.

Sanity treats this as an editorial workflow rather than an infrastructure one. Editors govern agent instructions and content in Studio, and Content Releases let you stage and schedule changes so embeddings reflect approved, published state rather than every in-progress draft. The team that owns the content owns when the agent sees it, without filing a ticket against the data pipeline. That closes the last gap in a refresh strategy: not just keeping the index current with what's true, but keeping it aligned with what's been deliberately released.

A reference architecture you can actually run

Put the pieces together and the pipeline that never goes stale looks less like a pipeline and more like an absence of one. The principle: minimize the number of representations of your content, drive refresh off content changes rather than a clock, keep the unit of embedding aligned with the unit of editing, and serve retrieval from the same store that holds the content.

In a self-assembled stack, that's a content backend feeding a change feed, a deduplicating worker that hashes and re-embeds only changed documents, a vector store plus a keyword index kept in sync, a reconciliation layer for hybrid results, and a staging mechanism so unpublished edits don't leak. Every component is a maintenance surface and a potential staleness source, and you own all of them.

With Sanity Context the same outcomes come from collapsing those layers. Content lives in the Content Lake; dataset embeddings are tied to that content and propagate within minutes, so there's no separate vector pipeline to maintain. Hybrid retrieval runs as one GROQ query with `text::semanticSimilarity()` and `match()` blended via `score()` and `boost()`. Knowledge Bases turn datasets, websites, PDFs, and support databases into agent-readable documents on the same retrieval path, so unstructured sources join the same freshness guarantee. Content Releases govern when changes go live, and production agents reach all of it through the Sanity Context MCP endpoint. The staleness window stops being something you engineer around and becomes a default.

How embedding freshness holds up across retrieval stacks

FeatureSanityPineconeContentfulpgvector / Neon
Refresh triggerChange-driven: dataset embeddings are tied to content in the Content Lake and propagate within minutes of an editYou own the trigger, a webhook or cron from your source has to upsert vectors; Pinecone stores them but doesn't know when content changedApp Framework webhooks can fire on publish, but you build the embedding worker and the vector store that consumes themYou write the change-capture and re-embed logic yourself; the DB stores vectors but won't detect stale rows
Second copy to keep in syncNone, the embedding isn't a separate copy in another system; the store you query is the store that holds the contentYes, vectors live in Pinecone separate from your source of truth, so the two can driftYes, an external vector index sits beside Contentful and must be reconciled with itVectors can live in the same Postgres as your data, but embedding freshness is still your code's job
Hybrid (vector + keyword) retrievalNative in one query: text::semanticSimilarity() + match() blended with score() and boost() in GROQSparse-dense hybrid is supported, but lexical signal and any content join are assembled by youNo native vector or hybrid search; pair with an external search/vector service and reconcile resultspgvector for similarity plus Postgres full-text search; you write the blending and ranking SQL
Delete / unpublish propagationUnpublishing a document removes it from retrieval on the same content path, no orphaned vectors to reapYou must issue deletes to Pinecone when source content is removed, or stale vectors keep surfacingDelete events exist, but propagating them to the external index is your integration's responsibilityCascade deletes work in-DB, but only if your re-embed pipeline wrote vectors as rows you control
Staging unreleased content from the agentContent Releases stage and schedule changes in Studio so embeddings reflect approved, published stateNo editorial layer; you build flags or a separate dataset to keep drafts out of the agent's reachEditorial workflows exist for the CMS, but gating what reaches the embedding index is custom glueNo content/editorial layer at all; staging is entirely application-level logic you maintain
Unstructured sources (PDFs, support DBs)Knowledge Bases turn datasets, websites, PDFs, and support databases into agent-readable docs on the same retrieval pathPinecone stores any vectors you generate, but ingesting and refreshing those sources is fully your pipelineBuilt for structured CMS entries; PDFs and support DBs need separate ingestion you build and refreshStore anything as rows, but parsing, chunking, and refreshing unstructured sources is all hand-built