The Actionable Metrics Pattern
Context
You’re evaluating a complex system with multiple interacting components. You can measure overall quality, but when quality degrades, you don’t know which component to fix.
Problem
Aggregate quality scores tell you that something is wrong without telling you where or what to do about it. This leads to trial-and-error debugging: try changing things until the score improves. Wasteful, slow, and prone to local maxima.
Solution
Design evaluation metrics that:
- Decompose by component: Each subsystem gets its own metric
- Map to interventions: A low score on metric X implies action Y
- Are independently measurable: You can evaluate each without running the full pipeline
- Distinguish failure modes: Different ways to fail produce different metric signatures
The RAGBench TRACe framework exemplifies this: low Relevance points to retriever issues, low Utilization points to generator prompting, low Adherence points to hallucination guardrails.
Consequences
Benefits:
- Targeted debugging instead of random experimentation
- Clear ownership when different teams own different components
- Enables systematic improvement roadmaps
Tradeoffs:
- Requires upfront work to design decomposed metrics
- May miss emergent failures that only appear at the system level
- Metrics must be calibrated to actual interventions
Beyond RAG
This pattern applies to any multi-component system:
- ML pipelines (data quality → feature engineering → model → postprocessing)
- User journeys (discovery → engagement → conversion → retention)
- Software systems (latency → throughput → error rate per service)
The pattern: don’t just measure outcomes, measure components in ways that tell you what to change.
Related: [None yet]