Multi-Dimensional LLM Evaluation Framework
A framework recognizing that LLM quality cannot be captured by single metrics. Effective evaluation requires assessing multiple dimensions appropriate to the use case.
Core Dimensions
Accuracy: Factual correctness, logical consistency Relevance: Addresses the actual question/task Completeness: Covers necessary aspects Conciseness: Appropriate length, no padding Safety: Avoids harmful outputs Consistency: Stable behavior across similar inputs Latency: Response time Cost: Computational resources required
Use-Case Dependent Weighting
Different applications weight dimensions differently:
- Customer Service: Relevance, safety high; latency matters
- Code Generation: Accuracy, completeness critical
- Creative Writing: Consistency less important; voice matters
- Research Assistance: Accuracy paramount; cost secondary
Evaluation Methods by Dimension
| Dimension | Automated | Human | Hybrid |
|---|---|---|---|
| Accuracy | Fact checking | Expert review | Ground truth + spot check |
| Relevance | Semantic similarity | User rating | Retrieval metrics + survey |
| Safety | Classifiers | Red team | Automated + human review |
Implementation
Define dimensions for your use case first. Then select metrics. Don’t let available metrics define what you measure.
Related: 05-atom—evaluation-metric-limitations, 03-atom—benchmark-ecological-validity