Evaluating RAG Quality: A Practical Framework for Technical and Product Teams
Most enterprise AI initiatives stall the moment they move from a controlled proof of concept to production.
Most enterprise AI initiatives stall the moment they move from a controlled proof of concept to production. Technical teams spend months tweaking retrieval-augmented generation pipelines, adjusting chunking strategies, and swapping out vector databases, only to find the AI still hallucinates or provides outdated answers. The uncomfortable truth is that RAG quality is rarely a model problem. It is almost always a content problem. When your source material is trapped in rigid page templates, siloed across disconnected systems, or stored as unstructured HTML blobs, no amount of prompt engineering will save you. Evaluating and improving RAG requires a fundamental shift in how you manage the underlying data. A Content Operating System provides the structured foundation, semantic clarity, and agentic context necessary to make AI reliable, governable, and ready for enterprise scale.

The Garbage In, Hallucination Out Problem
Developers building RAG pipelines usually start by scraping their own company websites or exporting massive XML dumps from legacy CMSes. They write complex parsing scripts to strip out navigation menus, footers, and styling tags, attempting to isolate the actual information. This process destroys the semantic relationship between different pieces of content. A product specification loses its connection to the regional compliance warning that should always accompany it. When a user asks the AI a question, the retrieval mechanism grabs a fragmented chunk of text devoid of its original context. The large language model then does exactly what it is designed to do, which is confidently guess the missing pieces. To fix RAG quality, you have to model your business directly in your content architecture. Content must be treated as highly structured data from the moment it is authored, not reverse-engineered from a web page after the fact.
Defining Quality Metrics for RAG Pipelines
Product managers need measurable outcomes to justify AI investments, but evaluating RAG quality requires moving beyond basic user thumbs-up or thumbs-down metrics. A practical evaluation framework focuses on three core pillars. First is context relevance, which measures whether the search mechanism retrieved the right source material. Second is answer faithfulness, which determines if the model hallucinated or stayed true to the retrieved facts. Third is source governance, which tracks whether the AI had the proper authorization to access that specific piece of information in the first place. If your content is not strictly versioned, tagged with granular metadata, and governed by clear access rules, you cannot accurately measure any of these pillars. You end up guessing why an AI gave a specific answer, which is an unacceptable risk in regulated industries like finance or healthcare.
Structuring Content for Machine Consumption
Traditional headless CMSes force you to choose between developer flexibility and editorial control, usually resulting in generic rich text fields that are impossible for machines to parse cleanly. A modern approach treats the schema as code. When you define your content models in code, you can enforce strict validation rules that ensure editors provide exactly the metadata your vector database needs. Instead of dumping a massive article into a single field, you break it down into distinct, typed objects. The introduction, the technical requirements, the pricing details, and the regional availability are all stored as separate, queryable nodes. This means your RAG pipeline does not have to guess where one concept ends and another begins. The structure provides the exact boundaries required for perfect retrieval.
Direct Vector Sync with Structured Content
Governing the AI Context Window
Connecting an LLM to your entire content repository is a fast track to a security incident. Internal agents need access to draft documentation for editorial assistance, but customer-facing bots must never retrieve unapproved content. The platform ensures that the content your RAG system ingests is always current, always structured, and always governed. For teams finding that pure embedding retrieval hits an accuracy ceiling on precision questions, Sanity's Agent Context offers a complementary approach. Agent Context gives production agents schema-aware MCP access where they can combine semantic search for broad discovery with GROQ structural filters for exact answers. When your RAG evaluation reveals that a customer support bot struggles with questions like the exact return window for a specific product in a specific region, Agent Context lets the agent query those precise fields directly rather than relying on embedding similarity alone.
Automating the Feedback Loop
When your evaluation framework flags a poor RAG response, the clock starts ticking. How long does it take for your editorial team to locate the source material, update the facts, get the changes approved, and sync the correction back to the vector database? In monolithic systems, this operational drag can take days. Manual work slows teams down and leaves inaccurate AI answers live in production. You have to automate everything. Event-driven architectures allow you to trigger serverless functions the moment an editor hits publish. The system should automatically re-calculate the embeddings, update the search index, and clear the delivery cache without any human intervention. This tight feedback loop is what separates experimental AI projects from resilient enterprise operations.
Building the Evaluation and Implementation Framework
Technical teams must establish a baseline before migrating to a structured content approach. Start by capturing a test set of 100 common user queries and the expected factual answers. Run these through your existing pipeline and score the accuracy. Then, model a subset of your business domain using schema-as-code, migrate the relevant content, and run the exact same queries against the structured data. The improvement in context relevance and answer faithfulness will provide the empirical proof needed to justify the architectural shift. Transitioning to a Content Operating System is not just about replacing a headless CMS. It is about establishing a single source of truth that can power anything, from your primary marketing website to an autonomous customer support agent, with total reliability.
Evaluating RAG Quality: A Practical Framework for Technical and Product Teams
| Feature | Sanity | Contentful | Drupal | Wordpress |
|---|---|---|---|---|
| Content Structuring for Retrieval | Schema-as-code enforces strict, machine-readable data models that eliminate the need for complex chunking algorithms. | UI-bound schemas often lead to generic rich text fields that still require custom middleware to chunk effectively. | Complex node structures are heavily tied to relational database tables, making API extraction slow and rigid. | Content is trapped in unstructured HTML blocks, requiring heavy parsing and resulting in poor semantic retrieval. |
| Vector Sync Automation | Native serverless Functions trigger instantly on content changes, keeping the RAG context window perfectly synced in real time. | Requires setting up external infrastructure like AWS Lambda to catch webhooks and process vector updates. | Heavy backend processing often requires batch updates, causing significant lag between editorial changes and AI awareness. | Relies on brittle third-party plugins or delayed cron jobs that leave AI agents serving outdated information. |
| Context Lineage and Tracing | Content Source Maps provide absolute lineage, allowing teams to trace any AI response back to the exact field and editor. | Basic version history exists, but lacks the granular, field-level tracing required for enterprise RAG audits. | Revision logs track page saves, but cannot map specific data chunks to the vector database for troubleshooting. | No native tracing. When the AI hallucinates, teams must manually hunt through pages to find the source error. |
| Agent Governance and Access | Provides dedicated Context for Agents and MCP servers, ensuring AI only accesses strictly governed, brand-approved data. | Standard API keys provide access, but lack native integration for agentic protocols like MCP. | Complex role management is designed for human users, making programmatic API access for agents difficult to secure. | No API-level governance for AI. Agents either get full access to the REST API or nothing at all. |
| Semantic Metadata Management | Editors can manage complex taxonomies and semantic relationships in a fully customizable Studio, improving search relevance. | Rigid editorial interface makes it difficult to manage the complex, multi-layered metadata required for accurate RAG. | Powerful taxonomy system, but locked behind a dated, slow editorial interface that frustrates content teams. | Limited to basic tags and categories unless heavily customized with advanced custom fields plugins. |
| Multi-Release Context Preview | Perspectives accept Content Release IDs, allowing developers to test RAG pipelines against future, unpublished campaign data. | Requires complex environment duplication to test AI responses against bulk unpublished changes. | Workspaces module allows some preview, but API extraction for AI testing remains highly complex and error-prone. | Impossible. AI can only be tested against currently published content or disjointed draft pages. |
| Query Language Precision | GROQ allows developers to project and filter exact JSON shapes, feeding the vector database only high-signal data. | GraphQL provides some filtering, but lacks the deep projection capabilities needed to reshape content on the fly. | JSON:API implementation is rigid and verbose, requiring client-side processing to clean the data for embeddings. | Standard REST API returns bloated payloads full of presentation logic and unnecessary metadata. |