RAG Technical Architecture

Overview

Retrieval-Augmented Generation (RAG) is a hybrid AI architecture that combines a neural retriever with a neural generator, creating a system with both parametric memory (model weights) and non-parametric memory (external corpus).

Why This Matters

Pure language models store all knowledge in parameters, which means knowledge is implicit, hard to update, and impossible to verify. RAG separates “what to know” from “how to reason,” allowing systems to be factually grounded, updateable without retraining, and capable of citing sources.

Core Components

The Retriever takes a query and returns the most relevant passages from a corpus. Modern retrievers use dense embeddings: a bi-encoder maps queries and documents to the same vector space, with relevance measured by similarity (typically dot product). Dense retrieval outperforms traditional keyword search because it captures semantic similarity, not just lexical overlap.

The Generator is a sequence-to-sequence model (typically transformer-based, like BART or T5) that produces output conditioned on both the query and retrieved documents. It learns to extract, synthesize, and rephrase information from retrieved context.

The Fusion Strategy determines how multiple retrieved documents are combined:

  • Early fusion: Concatenate all documents as input
  • Late fusion: Generate with each document separately, then marginalize
  • Fusion-in-Decoder: Encode separately, attend jointly during decoding

The Knowledge Corpus is the external store of documents (Wikipedia articles, internal documents, web pages. The corpus can be updated without retraining the model, enabling knowledge currency.

How It Works

  1. Index Time: Chunk documents, embed each chunk, store embeddings in a vector database
  2. Query Time: Embed the query, retrieve top-k similar chunks, optionally re-rank
  3. Generation Time: Condition the generator on query + retrieved context, produce output

The retriever and generator can be trained jointly (end-to-end) or separately. Joint training allows the retriever to learn what makes documents useful for the downstream task, not just topically relevant.

When to Use RAG

RAG excels when:

  • Knowledge changes faster than models can be retrained
  • Factual accuracy matters and needs to be verifiable
  • Domain knowledge is concentrated in specific documents
  • Users need to see sources or provenance

RAG struggles when:

  • Queries require reasoning across many disparate documents
  • The relevant knowledge isn’t in document form
  • Latency constraints are extremely tight
  • The retrieval corpus is poorly maintained or irrelevant

Limitations

  • Garbage in, garbage out: If retrieval returns irrelevant or incorrect documents, generation suffers
  • Retrieval bottleneck: The best generator can’t compensate for poor retrieval
  • Context limits: Number of retrieved documents is bounded by model context length
  • Pipeline complexity: More moving parts than pure LLM inference
  • RETRO: Retrieval during both training and inference
  • Atlas: RAG optimized for few-shot learning
  • Agentic RAG: LLM agents that decide when and what to retrieve

Related: 07-molecule—vectors-vs-graphs