AI product

RAG in production: what actually breaks (and how to avoid it)

Building a demo RAG takes a day. Making it reliable on real business documents is another matter: chunking, retrieval, evaluation, guardrails — a full field report, from the insurance world.

By Alexis Maresca·April 2, 2026·7 min read

A demo RAG comes together in a day: a vector store, a prompt, and it answers. A production RAG, with real business documents and real users, is another story. Here is what breaks first — seen from the field, on insurance data where errors are unforgiving.

1. Document chunking

90% of bad answers come from chunking. Blindly cutting a contract or a policy document into 500-token blocks destroys context: the coverage table ends up separated from its conditions. The solution: split along the document's real structure (headings, articles, tables), keep the metadata (source, section, date) and inject a context header into every chunk.

2. Retrieval, not generation

When the answer is wrong, the reflex is to blame the model. In most cases the problem is upstream: the right passages aren't being surfaced. A reranker, hybrid search (vectors + keywords) and metadata filters change the game far more than a better prompt.

3. Evaluation, or nothing

Without an evaluation set, every "improvement" is a gamble. The minimum viable setup: 50 to 100 real questions with expected answers, replayed on every change (chunking, model, prompt). That's what lets you say "this version is better" with something other than intuition — and sleep at night after a production release.

4. Guardrails

In production, you must plan for refusal: if the retrieved sources are weak, the system should say "I don't know" and route to a human, rather than hallucinate a coverage that doesn't exist. Fallbacks, logging every answer with its sources, and regular review of refused cases: that's where reliability is won — not in the choice of model.

■