The Four Stages of RAG

A typical RAG pipeline consists of four sequential stages that transform a query into a grounded response.

1. Chunking: Large documents are segmented into smaller, self-contained pieces (paragraphs, passages) for indexing. Fine-grained chunks improve retrieval precision (a query surfaces the relevant fragment, not an entire lengthy document. Chunk size balances context completeness against specificity.

2. Embedding: Each chunk is transformed into a high-dimensional vector that encodes its semantic content, typically via a transformer-based bi-encoder. These embeddings become keys in a vector index supporting efficient nearest-neighbor search. At query time, the query is embedded into the same space.

3. (Re)ranking: Initial retrieval via embedding similarity is fast but coarse. A re-ranker (often a cross-encoder) evaluates each retrieved chunk in context with the query and produces refined relevance scores. This two-stage approach, fast retrieval, accurate re-ranking, balances speed and precision.

4. Generation: The generator (typically a seq2seq model) produces output conditioned on the query and top-ranked passages. The model attends to retrieved content, copying or synthesizing relevant information into a coherent response.

Related: 05-atom—rag-definition