A language model only knows what it was trained on, frozen at a cutoff date and blind to your private data. RAG retrieval augmented generation fixes that by fetching relevant documents at query time and handing them to the model as context, so answers are grounded in real sources instead of guesswork.
Why RAG exists
Three hard limits push teams toward retrieval augmented generation. First, training data is stale the moment a model ships. Second, you cannot stuff an entire knowledge base into a prompt, even with a long context window, because cost and latency balloon and accuracy degrades when relevant facts get buried. Third, fine-tuning to inject facts is expensive, slow to update, and bad at citing where an answer came from.
RAG sidesteps all three. You keep your documents in a searchable store, retrieve only the handful of passages that matter for a given question, and let the model reason over them. When a policy changes, you re-index a document instead of retraining a model. And because the model sees the source text, it can quote and cite it.
How RAG actually works
A RAG pipeline has two phases: an offline indexing phase and an online query phase.
Chunking
Documents get split into smaller passages called chunks. You retrieve and feed chunks, not whole files, because a 40-page PDF is too coarse to match a specific question and too big to fit alongside other context. Common strategies:
- Fixed-size: split every 500 to 1000 tokens with a 10 to 20 percent overlap so sentences are not cut mid-thought.
- Structural: split on headings, paragraphs, or Markdown sections so each chunk is a coherent unit.
- Recursive: tools like LangChain's RecursiveCharacterTextSplitter try paragraph breaks first, then sentences, then characters, keeping related text together.
Chunk size is a real tradeoff. Small chunks retrieve precisely but lose surrounding context; large chunks carry context but dilute the match signal.
Embeddings
Each chunk is passed through an embedding model, such as OpenAI's text-embedding-3 or an open model like BGE, which turns the text into a vector, a list of numbers capturing its meaning. Passages about similar topics land close together in this vector space, so "how do I reset my password" sits near a chunk about account recovery even with zero shared words. The vectors go into a vector database like Pinecone, Weaviate, Qdrant, or pgvector.
Retrieval
At query time, the user's question is embedded with the same model, and the database returns the top-k nearest chunks by cosine similarity. Those chunks are pasted into the prompt template along with the question, and the LLM generates an answer grounded in them. Two upgrades matter in production:
- Hybrid search combines vector similarity with keyword search (BM25). Pure semantic search misses exact terms like error codes, SKUs, or names; keyword search catches them.
- Reranking takes the top 20 to 50 candidates and re-scores them with a cross-encoder like Cohere Rerank, pushing the truly relevant chunks to the top before they hit the model.
Common failure modes
Most broken RAG systems fail for a few recurring reasons, and naming them makes them fixable.
- Retrieval miss: the right chunk exists but never gets retrieved, often from bad chunking or a query that is phrased differently than the source. The model then answers from training data or hallucinates. Fix with hybrid search, better chunking, and query rewriting.
- Lost in the middle: when you cram many chunks into the prompt, models attend best to the start and end and skim the middle. Retrieve fewer, higher-quality chunks and rerank so the best ones land first.
- Chunk fragmentation: a fact split across two chunks means neither alone answers the question. Overlap and structural chunking reduce this.
- Stale index: documents change but embeddings do not get refreshed, so retrieval serves old answers. Wire re-indexing into your content pipeline.
- Embedding mismatch: indexing with one embedding model and querying with another, or mixing languages the model was not trained on, quietly wrecks similarity scores.
- No grounding guardrail: even with good context, models sometimes ignore it. Instruct the model to answer only from provided sources and to say "I don't know" when the context is thin.
What good looks like
A healthy retrieval augmented generation system is measured, not vibed. Track retrieval quality separately from generation quality: did the right chunks come back (recall, precision), and did the model use them faithfully (groundedness, citation accuracy)? Tools like Ragas and TruLens score these. When an answer is wrong, you can tell whether retrieval failed or generation failed, and that single distinction is what turns a flaky demo into a system you can trust in production.