Multi-Dimensional LLM Output Evaluation
Overview
When evaluating LLM-generated structured artifacts, no single metric captures quality adequately. A multi-dimensional approach reveals failure modes that single-metric evaluation misses.
Components
1. Requirements Coverage
Does the output satisfy stated requirements?
Check whether the output contains what was asked for. In ontology terms: can you write the queries the requirements specify? In code terms: do the tests pass? This is necessary but not sufficient.
2. Structural Quality
Is the output well-formed according to domain standards?
Automated scanners (like OOPS! for ontologies, linters for code) catch common structural errors: circular references, malformed constraints, inconsistent naming. These are objective and automatable.
3. Conciseness/Precision
Does the output contain only what’s needed?
Count superfluous elements, things that exist in the output but aren’t required by any stated requirement. LLMs tend to over-generate, so high coverage + high superfluity = noise. The ratio matters.
4. Expert Qualitative Assessment
Would a practitioner actually use this?
Human experts evaluate holistically: naming quality, organization, usability, whether the modeled solution matches how domain practitioners actually think. Catches issues automated metrics miss.
Why All Four Matter
| Dimension | What It Catches |
|---|---|
| Coverage | Incomplete outputs, missing requirements |
| Structural | Technical errors, standards violations |
| Conciseness | Over-generation, noise, redundancy |
| Expert | Usability issues, naming problems, conceptual mismatches |
An output can score well on three dimensions and fail the fourth. High coverage + clean structure + high redundancy = bloated. High coverage + low redundancy + poor usability = technically correct but painful to use.
Application Beyond Ontologies
This framework transfers to other structured LLM outputs:
- Generated code: tests pass (coverage), linter clean (structural), minimal lines (conciseness), code review approval (expert)
- Data schemas: validates sample data (coverage), schema valid (structural), no unused fields (conciseness), matches domain model (expert)
- Documentation: answers stated questions (coverage), follows style guide (structural), no tangents (conciseness), actually helpful (expert)
Limitations
- Expert evaluation is expensive and subjective
- Dimensions can trade off against each other
- Thresholds for “good enough” are context-dependent
- Doesn’t capture downstream consequences of subtle errors
Related: [None yet]