RAG Technical Architecture

Overview

Retrieval-Augmented Generation (RAG) is a hybrid AI architecture that combines a neural retriever with a neural generator, creating a system with both parametric memory (model weights) and non-parametric memory (external corpus).

Why This Matters

Pure language models store all knowledge in parameters, which means knowledge is implicit, hard to update, and impossible to verify. RAG separates “what to know” from “how to reason,” allowing systems to be factually grounded, updateable without retraining, and capable of citing sources.

Core Components

The Retriever takes a query and returns the most relevant passages from a corpus. Modern retrievers use dense embeddings: a bi-encoder maps queries and documents to the same vector space, with relevance measured by similarity (typically dot product). Dense retrieval outperforms traditional keyword search because it captures semantic similarity, not just lexical overlap.

The Generator is a sequence-to-sequence model (typically transformer-based, like BART or T5) that produces output conditioned on both the query and retrieved documents. It learns to extract, synthesize, and rephrase information from retrieved context.

The Fusion Strategy determines how multiple retrieved documents are combined:

Early fusion: Concatenate all documents as input
Late fusion: Generate with each document separately, then marginalize
Fusion-in-Decoder: Encode separately, attend jointly during decoding

The Knowledge Corpus is the external store of documents (Wikipedia articles, internal documents, web pages. The corpus can be updated without retraining the model, enabling knowledge currency.

How It Works

Index Time: Chunk documents, embed each chunk, store embeddings in a vector database
Query Time: Embed the query, retrieve top-k similar chunks, optionally re-rank
Generation Time: Condition the generator on query + retrieved context, produce output

The retriever and generator can be trained jointly (end-to-end) or separately. Joint training allows the retriever to learn what makes documents useful for the downstream task, not just topically relevant.

When to Use RAG

RAG excels when:

Knowledge changes faster than models can be retrained
Factual accuracy matters and needs to be verifiable
Domain knowledge is concentrated in specific documents
Users need to see sources or provenance

RAG struggles when:

Queries require reasoning across many disparate documents
The relevant knowledge isn’t in document form
Latency constraints are extremely tight
The retrieval corpus is poorly maintained or irrelevant

Limitations

Garbage in, garbage out: If retrieval returns irrelevant or incorrect documents, generation suffers
Retrieval bottleneck: The best generator can’t compensate for poor retrieval
Context limits: Number of retrieved documents is bounded by model context length
Pipeline complexity: More moving parts than pure LLM inference

RETRO: Retrieval during both training and inference
Atlas: RAG optimized for few-shot learning
Agentic RAG: LLM agents that decide when and what to retrieve

Related: 07-molecule—vectors-vs-graphs

>heyMHK

RAG Technical Architecture

RAG Technical Architecture

Overview

Why This Matters

Core Components

How It Works

When to Use RAG

Limitations

Properties

Graph view

Table of Contents

Backlinks

>heyMHK

RAG Technical Architecture

RAG Technical Architecture

Overview

Why This Matters

Core Components

How It Works

When to Use RAG

Limitations

Related Variants

Properties

Graph view

Table of Contents

Backlinks