Deploy Vercel-hosted AI agents on Sanity Context

Your agent works in `vercel dev`. It streams tokens, calls tools, answers correctly. Then you deploy it as an API route, real traffic hits it, and the logs fill up with `Task timed out after 10.00 seconds` and the occasional `FUNCTION_INVOCATION_TIMEOUT`. The model didn't get worse. The runtime did.

Part of the retrieval problem has a ready answer: Sanity Context, Sanity's agent-facing product, exposes a hosted Context MCP endpoint your Vercel function can query over HTTP without bundling anything into the build. The rest of the problems are yours to configure.

Vercel-hosted agents fail in ways your local machine never shows you: the function has a hard wall-clock limit, the runtime you picked (Node vs Edge) silently changes which SDKs even load, cold starts add latency right where the user is staring at a blank screen, and every retrieval round-trip to your data eats into a budget you didn't know you had. None of these are model problems. They're config problems.

This article walks the four knobs that decide whether a Vercel agent survives production: function duration and the streaming contract, runtime selection, where retrieval happens, and how you keep the agent's knowledge fresh without redeploying. The last knob is where Sanity Context earns its place in the stack, but only after the first three are set right.

Knob 1: maxDuration and the streaming contract

The default timeout on a Vercel function is short, and the agent loop is the one workload guaranteed to blow past it. A multi-step agent that calls a tool, waits on retrieval, calls the model again, then formats a response can easily run 20+ seconds. On the Hobby plan you get 60 seconds of wall clock; on Pro you can push `maxDuration` to 300. But the number isn't the real fix, the streaming contract is.

If you return a single JSON blob at the end, the function holds the connection open for the entire run and the user sees nothing until it completes. Stream instead, and the first token flushes in under a second while the rest of the work continues. The connection stays alive because bytes keep flowing, and the perceived latency drops to whatever your first retrieval takes.

Set `maxDuration` explicitly in the route, and use the AI SDK's streaming helpers so the response is a `ReadableStream`, not a resolved promise. The export is read at build time, so it has to be a static value, you can't compute it from a request.

⚠️

maxDuration is a ceiling, not a guarantee

Setting `maxDuration = 300` does nothing on a Hobby plan, it's capped at 60s and the deploy silently clamps it. If you see `FUNCTION_INVOCATION_TIMEOUT` with a 300 in your config, check the plan tier before you blame the model.

Streaming route with an explicit duration ceiling

An App Router API route that streams instead of buffering, with a 300s ceiling for the multi-step loop.

import { openai } from '@ai-sdk/openai';
import { streamText } from 'ai';

// Read at build time — must be a static literal, not computed.
export const maxDuration = 300;

export async function POST(req: Request) {
  const { messages } = await req.json();

  const result = streamText({
    model: openai('gpt-4o'),
    messages,
    maxSteps: 5, // tool-call loop budget
  });

  // Flushes the first token immediately; keeps the
  // connection alive while later steps run.
  return result.toDataStreamResponse();
}

Knob 2: Edge vs Node runtime, pick by what your tools need

Vercel gives you two runtimes per route, and the choice changes which code even loads. The Edge runtime boots in single-digit milliseconds and streams beautifully, but it runs a constrained Web-API environment: no native Node modules, no `fs`, no long-lived TCP connections. The Node runtime is heavier on cold start but runs the full ecosystem, most database drivers, the AWS SDK, anything that opens a raw socket.

Agents trip on this constantly. Your model call and your `fetch`-based tools run fine on Edge. The moment you add a tool that uses a Postgres driver or a vector-DB client that opens a persistent connection, you get a runtime error at the edge that never showed up locally on Node. The error is usually a cryptic `Module not found: Can't resolve 'net'` or a runtime crash deep in a dependency.

The rule that holds up in production: keep the agent's request handler on Edge for fast streaming, and make sure every tool talks over plain HTTP/`fetch` rather than a native driver. If a tool genuinely needs a Node-only client, move that route to the Node runtime and accept the cold-start cost, don't try to force the driver onto Edge.

Declaring the runtime per route

Runtime is a per-route export. Split the streaming handler (Edge) from any route that needs native Node modules.

// app/api/agent/route.ts
// Fast cold start + clean streaming. Every tool here must
// use fetch(), not a native socket-based driver.
export const runtime = 'edge';

// app/api/heavy-tool/route.ts
// Needs a Node-only client (e.g. a Postgres driver that
// opens a TCP socket). Pay the cold-start cost here, not
// in the streaming path.
export const runtime = 'nodejs';

Knob 3: where retrieval happens decides your latency budget

Inside a streamed agent run, the slowest step is almost never the model, it's retrieval. Every tool call that fetches context is a network round-trip from your Vercel region to wherever your data lives. Put your vector DB in `us-east-1` and your function in `iad1` and you're fine; get the regions wrong and each retrieval adds 100, 200ms of pure transit, multiplied by every step in the loop.

The second tax is shape. A naive RAG tool does an embedding lookup, gets back ten chunks, and the agent then needs three more tool calls to resolve the structural facts the user actually asked about, which product variant, which publication state, which date range. Pure vector similarity can't answer 'the published pricing page for the EU region' because that's a structural query wearing a semantic costume. So the loop burns extra steps, each one another round-trip against your `maxDuration` budget.

The fix is to collapse retrieval into one call that handles both the structural predicate and the fuzzy match. That's where a GROQ query against Sanity Context fits: structural filters live as predicates inside `*[ ... ]`, and when you genuinely need fuzzy matching you add it in the same query rather than in a separate vector hop. Most production projects on Context MCP never turn embeddings on at all, the heavy majority of agent retrieval is structured GROQ and schema lookups, and that resolves in one trip.

ℹ️

Embeddings are opt-in, and usually off

If your agent's retrieval failures are 'wrong document for a structural query' (wrong region, draft instead of published), more embeddings won't fix it, a structural predicate will. Reach for semantic similarity only when the failure is genuinely about meaning, not metadata.

A single retrieval tool that does structure first, fuzzy second

Here's the practical version of Knob 3. Instead of an embedding-lookup tool plus three follow-up tools, define one tool whose query carries the structural filter as predicates and, only where needed, layers semantic scoring on top. The structural part, publication state, region, type, runs as predicates inside the `*[ ... ]` filter. The fuzzy part runs through `score()` with `text::semanticSimilarity()` against the query text, and `order(_score desc)` ranks the result.

Wiring this into a Vercel-hosted AI SDK agent is straightforward: the GROQ API is HTTP-only, so it works on the Edge runtime with a plain `fetch`-based client. There's no native driver to break at the edge, which keeps you on the fast-streaming path from Knob 2.

The `text::semanticSimilarity($queryText)` operator takes the query text directly, you don't manage an embeddings column yourself. And the structural predicates do the heavy lifting: in production the majority of these calls never invoke semantic similarity at all, because the filter alone returns the right document.

One GROQ query: structural predicate + optional semantic score

Structural filters (type, published, region) are predicates in the *[ ... ] filter; relevance is ranked by score() over a BM25 title match and semantic similarity.

*[
  _type == "article" &&
  publishedAt < now() &&
  region == $region
] | score(
  boost(title match text::query($queryText), 3),
  text::semanticSimilarity($queryText)
) | order(_score desc) [0...5] {
  _id, title, region, publishedAt, body
}

Wiring Sanity Context into the AI SDK as an MCP server

You don't have to hand-write a GROQ tool to get started. Sanity ships Context MCP, a hosted, read-only MCP endpoint, and the AI SDK can attach it as a tool source directly. The agent gets schema-aware tools out of the box: it can introspect the content model and query it without you authoring each tool by hand. This is the fastest path in; the custom `tool()` wrapper from the previous section is the second path, for when you want full control over the exact query.

The read-only constraint matters on a hosted agent. Via MCP the agent can read and query your content but it cannot mutate it, writes go through a separate, governed path, not the endpoint your edge function exposes to user traffic. That's the behavior you want when an LLM is in the loop on production data.

Mixing both works well: attach the MCP endpoint for general schema-aware querying, and add one custom GROQ `tool()` for the hot-path retrieval where you want the exact predicate-plus-score query from the last section.

Attaching the hosted Context MCP endpoint in an AI SDK route

The AI SDK attaches Sanity's hosted Context MCP endpoint as a tool source. Read-only, HTTP-based, Edge-safe.

import { experimental_createMCPClient as createMCPClient } from 'ai';
import { streamText } from 'ai';
import { openai } from '@ai-sdk/openai';

export const runtime = 'edge';
export const maxDuration = 300;

export async function POST(req: Request) {
  const { messages } = await req.json();

  // Hosted, read-only MCP endpoint — schema-aware tools,
  // no driver to break on Edge.
  const mcp = await createMCPClient({
    transport: {
      type: 'sse',
      url: 'https://mcp.sanity.io/<project>/<dataset>',
    },
  });

  const result = streamText({
    model: openai('gpt-4o'),
    messages,
    tools: await mcp.tools(),
    maxSteps: 5,
    onFinish: () => mcp.close(),
  });

  return result.toDataStreamResponse();
}

Knob 4: freshness without a redeploy

The fourth knob is the one that bites weeks after launch. Your agent's instructions, its brand voice, its approved-answer snippets, its knowledge base, if any of that lives in a constant in your repo or a hardcoded system prompt, every edit is a code change, a PR, and a Vercel redeploy. Content people can't touch it, and a typo in the agent's tone means a deploy at 11pm.

The split that works: ephemeral per-user state, chat history, session memory, belongs in something like Upstash or Redis, fast and disposable. But the durable, governed content the agent reads, system instructions, approved responses, the knowledge corpus, belongs somewhere versioned, editable by humans, and previewable before it goes live. In a Vercel app that's a content backend the function reads at request time, not at build time.

This is the rest of what Sanity Context covers. Editorial content is versioned and edited in the Studio; Content Releases let a human stage and preview a change to the agent's instructions before publishing it. For messy, unstructured sources, PDFs, support exports, marketing sites, Knowledge Bases turn that corpus into ordered, queryable documents rather than a pile of chunks. The `next-sanity` client reads it over HTTP from your edge function, and the Live Content API pushes updates through without a redeploy. The agent's knowledge changes when an editor publishes, not when you ship.

💡

Don't put chat history in your CMS

The freshness split runs both ways. Ephemeral per-user state, session memory, scratchpad, should stay in Upstash/Redis. A content backend is for the durable, governed stuff: instructions, brand voice, approved answers, the knowledge corpus. Mixing them just makes both slower.

Reading governed agent instructions at request time

The system prompt is fetched at request time from a governed content backend, so editors change agent behavior without touching the deploy.

import { createClient } from 'next-sanity';
import { streamText } from 'ai';
import { openai } from '@ai-sdk/openai';

const sanity = createClient({
  projectId: process.env.SANITY_PROJECT_ID!,
  dataset: 'production',
  apiVersion: '2024-10-01',
  useCdn: true, // edge-friendly, HTTP-only
});

export const runtime = 'edge';

export async function POST(req: Request) {
  const { messages } = await req.json();

  // Edited by humans, versioned, previewed via Content
  // Releases — changes without a redeploy.
  const { systemPrompt } = await sanity.fetch(
    `*[_type == "agentConfig" && _id == "main"][0]{ systemPrompt }`
  );

  const result = streamText({
    model: openai('gpt-4o'),
    system: systemPrompt,
    messages,
    maxSteps: 5,
  });

  return result.toDataStreamResponse();
}

Putting the four knobs together

None of these knobs is independent. `maxDuration` only buys you time if you stream, and streaming only stays fast if retrieval is cheap, and retrieval is only cheap if you didn't force a native driver onto the Edge runtime and didn't scatter your data across regions. They form a chain, and production breaks at the weakest link.

The order to set them: pick the runtime first, because it constrains every tool you can write. Set `maxDuration` and switch to streaming second, because that's what keeps the function alive and the user unblocked. Collapse retrieval into one structured-first call third, because that's the step that actually eats your time budget. And move the agent's durable knowledge out of the repo last, because that's the one that turns a content edit into a redeploy if you skip it.

When you do reach for Sanity Context, reach for the surface that matches the problem: the Context MCP endpoint for schema-aware querying with the least wiring, a custom GROQ `tool()` for the hot-path query you want to control exactly, and Knowledge Bases for the unstructured corpus. All three read over HTTP, so all three stay on the fast Edge path you set in Knob 2. The agent that survives production is the one where the model is the least interesting part of the config.