LLM Providers

Ground Groq-powered agents in Sanity Context

Groq

Groq's LPU inference runs Llama, Mixtral, and other open models at hundreds of tokens per second, so your agent loop replies in well under a second.

Visit Groq

You moved your agent to Groq because the latency was embarrassing everywhere else. Now a tool call returns in 300ms, the model streams a full answer before the user finishes reading the question, and the demo lands. Then someone asks "what changed in the Q3 pricing doc?" and your blazing-fast agent confidently invents a number that was never in any document. The speed didn't fix anything. It just got you to the wrong answer faster.

Sanity Context gives the loop somewhere real to look: its Context MCP endpoint takes a GROQ query and returns actual schema-bound content, not a guess interpolated from training data. That changes what Groq's speed is actually worth.

That's the trap with Groq. The LPU makes the model cheap to call, so you call it more, and every extra turn that runs on stale or missing context compounds. A 200ms response built on the wrong retrieved chunk is worse than a 2-second response built on the right one, because the user trusts the fast one.

This article is about the half of a Groq agent that isn't the model: how you feed it. We'll cover keeping your token budget honest so you actually use the throughput, structuring retrieval so the speed buys you more reasoning turns instead of more hallucinations, and wiring a governed structured-content source under the loop so the agent stops guessing at facts it should be able to look up.

Why Groq's speed makes bad retrieval more dangerous, not less

The Groq SDK is a near drop-in for the OpenAI client, which is the whole point. You change the base URL, swap the model to something like `llama-3.3-70b-versatile`, and your existing agent code runs at LPU speed. Tool calls that used to feel like a deliberate round trip now feel free.

That changes how you build. When a model call costs you 4 seconds, you minimize calls: one big prompt, one answer. When it costs 300ms, you start chaining. Plan the task, call a tool, reflect, call another tool, summarize. Multi-turn agent loops that were too slow to ship on GPT-4 become viable on Groq. This is good, right up until you notice that every one of those turns reads from the same context you assembled at the top.

Here's the failure mode in practice. Your agent retrieves three documents at the start, stuffs them into the system prompt, then runs eight reasoning turns. If the retrieval was wrong, all eight turns are wrong, and they run fast enough that nobody watches them happen. On a slow model you'd see the bad answer forming and kill the request. On Groq the whole thing completes before you can intervene, and the confident tone of a 70B model makes the hallucination read as fact.

Speed amplifies whatever you feed it. If your retrieval layer is a vector search that returns the closest-looking chunk regardless of whether it's the current pricing doc or last year's, Groq will turn that wrong chunk into a fluent, fast, wrong answer. The fix is never a faster model. It's making sure the context that goes into the loop is correct before the first token streams.

âš ī¸

Fast wrong answers read as true

User trust scales with response latency in the wrong direction. A 200ms reply gets less scrutiny than a 3s one, so retrieval bugs that a slow model would have surfaced sail straight through on Groq. Validate retrieval quality before you optimize for throughput, not after.

Spend your token budget on the right context, not more of it

Groq prices and rate-limits on tokens per minute, and the LPU's throughput advantage shrinks as your context window grows. A 2,000-token prompt streams back almost instantly. A 30,000-token prompt where you dumped your entire knowledge base into the system message eats your TPM limit, slows the time-to-first-token, and buries the relevant fact in noise the model has to read past.

The instinct on a fast model is to throw more context at it because you can. Resist that. The job is to put the *right* 2,000 tokens in front of the model, not the most. That means retrieval has to do real filtering before the prompt is assembled, not after.

Here's a Groq tool-calling loop in TypeScript. The agent declares a `search_docs` tool, the model decides when to call it, and you run the actual retrieval. The point is that the retrieval function is where correctness lives. Groq just orchestrates the calls.

Groq tool-calling loop with a retrieval tool

The model picks the tool and args; your searchDocs function decides what's correct.

import Groq from "groq-sdk";

const groq = new Groq({ apiKey: process.env.GROQ_API_KEY });

const tools = [
  {
    type: "function" as const,
    function: {
      name: "search_docs",
      description: "Retrieve current docs for a question",
      parameters: {
        type: "object",
        properties: {
          query: { type: "string" },
          productLine: { type: "string" },
        },
        required: ["query"],
      },
    },
  },
];

const res = await groq.chat.completions.create({
  model: "llama-3.3-70b-versatile",
  messages: [{ role: "user", content: "What changed in Q3 pricing?" }],
  tools,
  tool_choice: "auto",
});

const call = res.choices[0].message.tool_calls?.[0];
if (call?.function.name === "search_docs") {
  const args = JSON.parse(call.function.arguments);
  const docs = await searchDocs(args.query, args.productLine);
  // feed docs back as a tool message, then call Groq again
}

Structured retrieval beats semantic search for most agent questions

Look at the questions a real agent actually gets. "What changed in Q3 pricing?" has a structural component: it's about the *current published* pricing doc, for *Q3*, in a specific *product line*. Pure embedding similarity doesn't resolve any of that. It returns whatever chunk reads closest to the words in the question, which might be the Q3 doc, might be last year's Q3 doc, might be a blog post that quotes old pricing.

This is the part most RAG tutorials get backwards. They reach for a vector database first and treat semantic search as the default. In production, the heavy majority of useful agent retrieval is structured: filter by status equals published, by date range, by author, by product variant, then return the matching records. Semantic similarity is one ingredient you add when keyword and structure aren't enough to disambiguate, not the foundation.

The practical version of this inside a `searchDocs` function: resolve the structural predicates first. If the user asked about Q3 pricing, your query should constrain to documents where the type is pricing, the quarter is Q3, and the publication state is live, before any similarity ranking runs. That alone eliminates most of the wrong-chunk failures, and it's deterministic, so it's testable. You can write an assertion that the Q3 question never returns the Q2 doc. You cannot write that assertion against cosine distance.

Reach for embeddings when the query is genuinely fuzzy and the structure has already narrowed the candidate set: "find the doc that explains the thing about overage billing" where the user doesn't know the exact term. Run the structural filter, then rank what survives by semantic similarity. That ordering, structure first and similarity second, is what keeps a fast Groq loop pointed at the right document.

â„šī¸

Vector search is one ingredient, not the recipe

Most agent queries carry a structural signal (a date, a status, an author, a product variant) that embeddings can't resolve. Filter on those predicates first, then rank the survivors by similarity if you still need to. Leading with vector search is why the wrong-but-similar chunk keeps winning.

Group failed turns by what the agent retrieved, not what it said

When a Groq agent gives a bad answer, the debugging instinct is to look at the model output and tweak the prompt. That's usually the wrong layer. The model said something fluent and wrong because it was handed something wrong. The bug is in retrieval far more often than in generation.

So log what the agent *saw*, not just what it *said*. For every turn, record the retrieval call: the query the model generated, the filters applied, the documents returned, and their IDs. When you wire an OpenTelemetry exporter or a tool like Helicone around the Groq client, attach those retrieval traces as spans alongside the completion. Now a bad answer has a paper trail you can follow back to the chunk that caused it.

Do this and your eval story changes shape. Instead of grading final answers, which is noisy and subjective, you grade retrieval: did the Q3 pricing question return the Q3 pricing doc? That's a binary you can assert in CI. Group your failed traces by retrieved document ID and patterns jump out fast. If half your bad answers traced back to the same stale doc that should have been unpublished, that's a content problem, not a prompt problem, and no amount of prompt engineering fixes it.

The reason this matters more on Groq specifically: the loop runs so fast that you can't catch failures by watching them. Your only window into what went wrong is the trace. Make the trace include retrieval and the fast loop becomes debuggable. Skip it and you're left re-running prompts and hoping, which on a model this fast means burning a lot of tokens to learn very little.

Wrap the Groq call so retrieval is in the trace

Every answer carries the doc IDs that produced it, so failures group by source.

import os
from groq import Groq

client = Groq(api_key=os.environ["GROQ_API_KEY"])

def answer(question: str):
    query, filters = plan_retrieval(question)
    docs = search_docs(query, filters)

    # log what the agent SAW before it speaks
    trace.log(
        "retrieval",
        query=query,
        filters=filters,
        doc_ids=[d["_id"] for d in docs],
    )

    completion = client.chat.completions.create(
        model="llama-3.3-70b-versatile",
        messages=[
            {"role": "system", "content": render_context(docs)},
            {"role": "user", "content": question},
        ],
    )
    return completion.choices[0].message.content

Point the loop at a governed structured-content source

Now the part that closes the loop. Your `searchDocs` function is only as good as what it queries. If it reads from a vector index you sync nightly from a CMS, you've built two sources of truth, and the lag between them is exactly where the stale-Q3-doc bug lives. The agent needs to retrieve from a source that is structured, queryable by predicate, and governed, so that "current published pricing" is a fact the query can express and an editor can control.

This is where Sanity Context fits under a Groq agent. Sanity is the Content Operating System for the AI era: your content is modeled as typed documents in the Content Lake, and you query it with GROQ, which gives you exactly the structure-first retrieval the previous sections argued for. Status, date range, product line, and publication state are all predicates you filter on directly. The same query can layer in semantic ranking when you need it, but the structured filter does the heavy lifting, which matches what production retrieval actually looks like.

The fastest way to wire this into a Groq loop is the Context MCP endpoint, a hosted, read-only MCP server. Your agent attaches it as an MCP server and gets schema-aware retrieval tools without you hand-writing query code. Read-only is the right default for an agent: it can look things up, but writes go through Sanity's governed Agent Actions, not the agent loop, so a confused model can't mutate your pricing. For teams that want full query control, a thin `tool()` wrapper running a typed GROQ query is the second path.

For messy, unstructured sources (PDFs, support exports, scraped websites), Sanity Context has Knowledge Bases, which turn that sludge into well-ordered documents with a table of contents the agent can navigate. Structured catalog content goes through GROQ; unstructured corpora go through Knowledge Bases. Either way the agent retrieves from one governed source instead of a vector index that drifted out of date last Tuesday.

A structure-first GROQ query the agent retrieves with

Structural predicates filter first; score() ranks the survivors by keyword and semantic similarity.

*[_type == "pricingDoc"
  && quarter == $quarter
  && productLine == $productLine
  && status == "published"
] | score(
    boost(title match text::query($queryText), 2),
    text::semanticSimilarity($queryText)
  )
  | order(_score desc)[0...3]{
    _id, title, quarter, body
  }

Store agent instructions where humans can version them

One more thing the speed of Groq tends to expose: prompt sprawl. When the loop is cheap to run, teams iterate on system prompts, retrieval instructions, and approved response templates constantly, and those strings end up hardcoded across a dozen files or in a config nobody reviews. There's no history, no preview, and no way for a non-engineer to fix a wrong instruction without a deploy.

Draw the line clearly. Ephemeral per-user state, the running chat history for the current session, belongs in something like Upstash or Redis. It's hot, it's disposable, and it shouldn't be versioned. But the *editorial* state, your agent's instructions, its brand voice, the approved answers it should prefer, the knowledge it retrieves from, is content. It should be modeled, versioned, edited by the people who own the messaging, and previewed before it goes live.

That's the case for keeping those artifacts in Sanity rather than in code. The same Content Lake your agent retrieves facts from also holds its instructions, and Content Releases gives you the stage-and-preview workflow: a content editor drafts a new set of approved responses, reviews them in a release, and ships them without a redeploy of your Groq app. The agent picks up the change on its next retrieval because it reads from the same governed source. Roles and permissions decide who can touch what.

The net effect on a fast loop is calmer. Groq gives you throughput; the structured-content layer underneath gives you correctness and control. The model stays fast, the facts stay current, the instructions stay reviewable, and when something does go wrong your traces point at a specific document you can open and fix. That's the combination that survives contact with production, not raw tokens per second.

✨

One governed source under a fast loop

The same Content Lake feeds the agent its facts and its instructions, both queryable with GROQ, both versioned, both previewable through Content Releases before going live. Groq supplies the speed; the structured-content layer keeps every fast answer pointed at current, governed content.