RAG Fusion Strategies: Early vs. Late

RAG systems must decide when and how to integrate information from multiple retrieved documents. Two dominant strategies have emerged.

Early fusion concatenates all retrieved passages into a single extended input before generation. The model attends to everything at once. Simpler to implement, but struggles with many documents due to context length limits. The generator performs its own implicit weighting.

Late fusion considers each retrieved document separately, generating candidate outputs for each, then marginalizes (combines) the results. This is the original RAG approach, it computes P(y|x) by summing across document-conditioned probabilities weighted by retrieval scores. More principled probabilistically, but computationally heavier.

Fusion-in-Decoder (FiD) represents a middle path: encode each passage separately, but attend to all encodings during decoding. This scales better than pure early fusion while allowing cross-document reasoning.

The choice affects how conflicting evidence is handled: early fusion forces the model to reconcile conflicts internally; late fusion distributes probability across alternatives.

Related: 05-atom—rag-definition, 05-molecule—attention-mechanism-concept