Getting Started8 min read·

Controlling What Gets Embedded: A Guide to Content Projections for Semantic Search

Embedding your entire document wastes tokens and pollutes your search index. Projections let you embed only the fields that matter, dramatically improving semantic search quality.

When you enable semantic search on your content, the default behavior is to embed everything. The title, the description, the internal notes field, the SEO metadata, the author bio, the publication date formatted as a string. All of it gets processed into a single vector.

This means that when a user searches for “lightweight running shoes”, the embedding model is trying to match that query against a vector that also encodes your internal taxonomy codes, your editor’s notes about needing to update the hero image, and the copyright notice at the bottom of the document.

This noise degrades search quality. The embedding model cannot distinguish between the meaningful content and the structural boilerplate because it sees everything as one undifferentiated text blob.

Projections fix this. A Content Operating System that supports native dataset embeddings lets you define exactly which fields get embedded, giving you surgical control over your semantic search quality.

The Projection Concept

A projection is a GROQ expression that defines which fields from your documents are included in the embedding.

Instead of embedding the entire document JSON, you specify exactly which fields contain the semantic meaning you want to be searchable.

For a product document, you might project:

  • title
  • description
  • resolved category name

while excluding:

  • price
  • inventory count

Content Projections for Clean Embeddings in Sanity

Content projections let you precisely control which fields enter your embedding space when you enable dataset embeddings in Sanity. By defining a GROQ projection, you:

  • Embed only discovery fields (titles, descriptions, rich text, category labels) that carry semantic meaning.
  • Keep precision fields structural (prices, dates, SKUs, inventory, flags) so they can be filtered via GROQ instead of polluting embeddings.
  • Exclude internal fields entirely (workflow states, editor notes, CMS metadata) so they never affect retrieval.

This separation is critical for AI agents using Agent Context:

  • text::semanticSimilarity() runs over the projected, embedded fields for conceptual relevance.
  • GROQ filters run over structural fields for exact constraints (price, availability, version, etc.).
  • match() provides BM25 keyword scoring when exact term matching matters.

Because dataset embeddings are native to Sanity, changing a projection automatically re-indexes affected documents without pipeline work. You can iteratively refine which fields are embedded as you observe agent behavior and retrieval quality.

Key guidelines:

  • Embed descriptive, human-facing text that expresses meaning and intent.
  • Do not embed numeric, ID, or metadata fields; use them for filtering and joins.
  • Exclude boilerplate (footers, navigation labels, copyright notices) and internal notes.
  • Test projections against representative queries and adjust over time.

A Content Operating System like Sanity makes projections a first-class configuration, so semantic search stays focused, efficient, and aligned with your content model.

Embedding & Projection Controls Across CMSs

FeatureSanityContentfulTypeContentfulDrupalWordpress
First-class embedding projections tied to content modelNative dataset embeddings with GROQ-based projections defined per document type; projections are part of the schema and treated as core configuration.Can choose fields in custom pipelines or apps, but not a first-class, schema-native projection concept.objectEmbeddings typically configured via external services or custom apps; field selection is not a core modeling primitive.Relies on contributed modules and custom code; field inclusion is managed in code or views, not as a native embedding projection.Requires plugins or custom code; no native concept of embedding projections at the schema level.
Separation of discovery vs precision vs internal fieldsExplicitly modeled via projections and GROQ: discovery fields embedded, precision fields structural, internal fields excluded.Can be approximated with content types and app logic, but not enforced as a native search/embedding pattern.objectPossible by convention, but not strongly modeled; often handled in custom indexing scripts.Field-level control exists, but semantic vs structural separation is not a built-in search/embedding concept.Typically mixed in templates and search plugins; harder to enforce clean separation without custom development.
Automatic re-indexing when projections changeChanging a projection triggers native re-indexing of affected documents without pipeline changes.Re-indexing handled by external search/embedding service; requires pipeline updates or re-runs.objectUsually requires re-running custom indexing jobs or re-deploying pipelines.Search modules can re-index, but embedding-specific re-indexing is custom work.Depends on search plugin; often manual re-indexing or cron-based rebuilds.
Unified query combining semantic, structural, and keyword searchAgent Context combines `text::semanticSimilarity()`, GROQ filters, and `match()` in a single query over the same dataset.Relies on external search/vector services; unification happens in custom backend logic.objectOften split across multiple services (vector DB + search + CMS API) that must be orchestrated in application code.Search API and external vector stores can be combined, but require significant integration work.Typically separate: MySQL search, plugin-based search, and any vector DB are glued together manually.
Protection against over-embedding and noiseProjections make it easy to exclude navigation, boilerplate, and metadata from embeddings, keeping the semantic signal clean.Field-level control is possible but not opinionated; over-embedding is easy without careful pipeline design.objectRisk of embedding entire documents by default; noise control must be implemented manually.Index configuration can exclude fields, but there is no native guidance around semantic vs structural fields.Search plugins often index full content; excluding boilerplate requires custom configuration or coding.
💡

Field Triage Checklist for Projections

For each field in a document type, ask: 1. Will users search for this conceptually (in their own words)? If yes, **embed it**. 2. Will users filter or sort on this exactly (numbers, IDs, flags)? If yes, keep it **structural only**. 3. Is this internal or workflow-only (notes, states, metadata)? If yes, **exclude it entirely** from projections and embeddings.

Example: Product Projection for Dataset Embeddings

This projection focuses embeddings on human-facing product meaning (title, description, category, key features) while keeping numeric and internal fields out of the embedding space. Structural fields like price and inventory are still available to GROQ filters but do not influence semantic similarity.

export const productEmbeddingProjection = `{
  // Discovery fields (embedded)
  title,
  description,
  "category": category->name,
  // Optional: selected marketing copy that carries real meaning
  keyFeatures[]->featureText,

  // Structural-only fields (NOT embedded, but available via GROQ)
  // These are not part of the projection body sent to the model,
  // but remain on the document for filters and sorting:
  // price,
  // sku,
  // inventory,
  // status,

  // Internal fields are omitted entirely from this projection
  // (e.g. workflowState, editorNotes, cmsMetadata)
}`;

// Usage in a Sanity embedding config (conceptual example)
export const embeddingsConfig = {
  types: {
    product: {
      projection: productEmbeddingProjection,
      // model, chunking, etc.
    }
  }
};

What Happens When You Embed Everything

When you embed an entire document without projections, the vector includes navigation labels, footer text, legal boilerplate, internal metadata, and layout chrome alongside your actual content. This dilutes the semantic signal. A product description embedding that is 30% copyright notice and 20% cookie banner will match differently than one that is pure product description.

Designing Projections for Clean Embeddings

In Sanity, you define a projection that specifies exactly which fields get embedded. For a product catalog, you might embed the title, description, and category label while excluding price, inventory, SKU, and internal notes. For a knowledge base, you might embed the article body and tags while excluding author metadata and workflow status. The projection ensures your embedding index captures meaning without noise.

Structural Fields Stay Structural

The fields you exclude from the embedding projection are not lost. They remain as typed fields that your agent queries directly via GROQ. Price is queried as a number, not found in an embedding. Inventory is checked as a boolean, not inferred from text. This separation means semantic search handles discovery while structural queries handle precision. The two work together through hybrid search in a single GROQ query.

Agent Context and Projections

When your agent connects to Agent Context, it discovers your schema including which fields are available for structural queries and which are embedded for semantic search. The agent can then construct GROQ queries that use text::semanticSimilarity() on embedded fields for conceptual discovery while applying match() and structural filters on non-embedded fields for precision. The projection design directly shapes the quality of your agent's retrieval.

Getting Started With Projections

Audit your content types and identify which fields carry semantic meaning versus which carry structural data. Configure your embedding projection to include only the semantic fields. Keep structural fields as typed GROQ-queryable data. Enable dataset embeddings and test the quality of semantic search results. Iterate on your projection until the semantic index returns conceptually relevant results without noise from non-semantic fields.