The Upstream Evaluation Gap

RAG systems depend critically on preprocessing decisions, chunking and embedding, yet these upstream components receive disproportionately less evaluation attention than retrieval and generation.

Chunking determines what units of information are retrievable. Too large, and relevant content gets diluted by irrelevant context. Too small, and coherent ideas fragment across multiple chunks.

Embedding quality determines whether semantically related content clusters appropriately. A poor embedding model can doom retrieval regardless of the retrieval algorithm’s sophistication.

The gap exists because upstream evaluation is indirect: chunking and embedding quality manifest primarily through their downstream effects on retrieval metrics. Direct intrinsic evaluation is harder, there’s no ground truth for “optimal chunk boundaries.”

Benchmarks like MTEB and MMTEB address embedding evaluation, but chunking strategy evaluation remains underdeveloped. The field has standardized on evaluating what’s easy to measure rather than what’s foundational to system performance.

Related: [None yet]