The Actionable Metrics Pattern

Context

You’re evaluating a complex system with multiple interacting components. You can measure overall quality, but when quality degrades, you don’t know which component to fix.

Problem

Aggregate quality scores tell you that something is wrong without telling you where or what to do about it. This leads to trial-and-error debugging: try changing things until the score improves. Wasteful, slow, and prone to local maxima.

Solution

Design evaluation metrics that:

  1. Decompose by component: Each subsystem gets its own metric
  2. Map to interventions: A low score on metric X implies action Y
  3. Are independently measurable: You can evaluate each without running the full pipeline
  4. Distinguish failure modes: Different ways to fail produce different metric signatures

The RAGBench TRACe framework exemplifies this: low Relevance points to retriever issues, low Utilization points to generator prompting, low Adherence points to hallucination guardrails.

Consequences

Benefits:

  • Targeted debugging instead of random experimentation
  • Clear ownership when different teams own different components
  • Enables systematic improvement roadmaps

Tradeoffs:

  • Requires upfront work to design decomposed metrics
  • May miss emergent failures that only appear at the system level
  • Metrics must be calibrated to actual interventions

Beyond RAG

This pattern applies to any multi-component system:

  • ML pipelines (data quality → feature engineering → model → postprocessing)
  • User journeys (discovery → engagement → conversion → retention)
  • Software systems (latency → throughput → error rate per service)

The pattern: don’t just measure outcomes, measure components in ways that tell you what to change.

Related: [None yet]