Evaluation Metric Mismatch in QA Systems
Standard evaluation metrics for question-answering systems frequently penalize outputs that are semantically correct but phrased differently from the reference answer.
Exact match scores in one study dropped to near zero not because systems produced wrong answers, but because they used conversational phrasing when benchmarks expected terse responses. F1 scores showed similar brittleness to surface-level variation.
LLM-based correctness scoring proves more tolerant of lexical variation but introduces inconsistencies of its own, near-verbatim answers sometimes receive less than full credit due to format sensitivity and implicit assumptions baked into the grader.
The measurement problem compounds the optimization problem: if your metrics don’t capture what matters, tuning to them may not improve actual utility.
Related: 05-atom—uniform-confidence-problem