Home / Blog / RAG retrieval augmented generation

RAG Retrieval Augmented Generation, Explained

June 14, 20266 min readBy Roopesh LR
Why your LLM needs a memory it can search

A language model only knows what it was trained on, frozen at a cutoff date and blind to your private data. RAG retrieval augmented generation fixes that by fetching relevant documents at query time and handing them to the model as context, so answers are grounded in real sources instead of guesswork.

Why RAG exists

Three hard limits push teams toward retrieval augmented generation. First, training data is stale the moment a model ships. Second, you cannot stuff an entire knowledge base into a prompt, even with a long context window, because cost and latency balloon and accuracy degrades when relevant facts get buried. Third, fine-tuning to inject facts is expensive, slow to update, and bad at citing where an answer came from.

RAG sidesteps all three. You keep your documents in a searchable store, retrieve only the handful of passages that matter for a given question, and let the model reason over them. When a policy changes, you re-index a document instead of retraining a model. And because the model sees the source text, it can quote and cite it.

How RAG actually works

A RAG pipeline has two phases: an offline indexing phase and an online query phase.

Chunking

Documents get split into smaller passages called chunks. You retrieve and feed chunks, not whole files, because a 40-page PDF is too coarse to match a specific question and too big to fit alongside other context. Common strategies:

Chunk size is a real tradeoff. Small chunks retrieve precisely but lose surrounding context; large chunks carry context but dilute the match signal.

Embeddings

Each chunk is passed through an embedding model, such as OpenAI's text-embedding-3 or an open model like BGE, which turns the text into a vector, a list of numbers capturing its meaning. Passages about similar topics land close together in this vector space, so "how do I reset my password" sits near a chunk about account recovery even with zero shared words. The vectors go into a vector database like Pinecone, Weaviate, Qdrant, or pgvector.

Retrieval

At query time, the user's question is embedded with the same model, and the database returns the top-k nearest chunks by cosine similarity. Those chunks are pasted into the prompt template along with the question, and the LLM generates an answer grounded in them. Two upgrades matter in production:

Common failure modes

Most broken RAG systems fail for a few recurring reasons, and naming them makes them fixable.

What good looks like

A healthy retrieval augmented generation system is measured, not vibed. Track retrieval quality separately from generation quality: did the right chunks come back (recall, precision), and did the model use them faithfully (groundedness, citation accuracy)? Tools like Ragas and TruLens score these. When an answer is wrong, you can tell whether retrieval failed or generation failed, and that single distinction is what turns a flaky demo into a system you can trust in production.

Go deeper

AI CEO — How AI Will Replace the Tech Industry

This is the surface. The full argument — with the data, the case studies, and the playbook — is in the book. Roopesh LR's AI CEO is available to learn more.

Get the book →
chunking strategyvector embeddingssemantic searchvector databasehybrid searchLLM hallucinationrerankingcontext window
© 2026 Roopesh LR · AI CEOAll articles · aiceo.me