RAG in production: what actually breaks (and how to avoid it)
Building a demo RAG takes a day. Making it reliable on real business documents is another matter: chunking, retrieval, evaluation, guardrails — a full field report, from the insurance world.
A demo RAG comes together in a day: a vector store, a prompt, and it answers. A production RAG, with real business documents and real users, is another story. Here is what breaks first — seen from the field, on insurance data where errors are unforgiving.
1. Document chunking
90% of bad answers come from chunking. Blindly cutting a contract or a policy document into 500-token blocks destroys context: the coverage table ends up separated from its conditions. The solution: split along the document's real structure (headings, articles, tables), keep the metadata (source, section, date) and inject a context header into every chunk.
2. Retrieval, not generation
When the answer is wrong, the reflex is to blame the model. In most cases the problem is upstream: the right passages aren't being surfaced. A reranker, hybrid search (vectors + keywords) and metadata filters change the game far more than a better prompt.
3. Evaluation, or nothing
Without an evaluation set, every "improvement" is a gamble. The minimum viable setup: 50 to 100 real questions with expected answers, replayed on every change (chunking, model, prompt). That's what lets you say "this version is better" with something other than intuition — and sleep at night after a production release.
4. Guardrails
In production, you must plan for refusal: if the retrieved sources are weak, the system should say "I don't know" and route to a human, rather than hallucinate a coverage that doesn't exist. Fallbacks, logging every answer with its sources, and regular review of refused cases: that's where reliability is won — not in the choice of model.