Multi-Dimensional LLM Evaluation Framework

A framework recognizing that LLM quality cannot be captured by single metrics. Effective evaluation requires assessing multiple dimensions appropriate to the use case.

Core Dimensions

Accuracy: Factual correctness, logical consistency Relevance: Addresses the actual question/task Completeness: Covers necessary aspects Conciseness: Appropriate length, no padding Safety: Avoids harmful outputs Consistency: Stable behavior across similar inputs Latency: Response time Cost: Computational resources required

Use-Case Dependent Weighting

Different applications weight dimensions differently:

  • Customer Service: Relevance, safety high; latency matters
  • Code Generation: Accuracy, completeness critical
  • Creative Writing: Consistency less important; voice matters
  • Research Assistance: Accuracy paramount; cost secondary

Evaluation Methods by Dimension

DimensionAutomatedHumanHybrid
AccuracyFact checkingExpert reviewGround truth + spot check
RelevanceSemantic similarityUser ratingRetrieval metrics + survey
SafetyClassifiersRed teamAutomated + human review

Implementation

Define dimensions for your use case first. Then select metrics. Don’t let available metrics define what you measure.

Related: 05-atom—evaluation-metric-limitations, 03-atom—benchmark-ecological-validity