Multi-Dimensional LLM Output Evaluation

Overview

When evaluating LLM-generated structured artifacts, no single metric captures quality adequately. A multi-dimensional approach reveals failure modes that single-metric evaluation misses.

Components

1. Requirements Coverage

Does the output satisfy stated requirements?

Check whether the output contains what was asked for. In ontology terms: can you write the queries the requirements specify? In code terms: do the tests pass? This is necessary but not sufficient.

2. Structural Quality

Is the output well-formed according to domain standards?

Automated scanners (like OOPS! for ontologies, linters for code) catch common structural errors: circular references, malformed constraints, inconsistent naming. These are objective and automatable.

3. Conciseness/Precision

Does the output contain only what’s needed?

Count superfluous elements, things that exist in the output but aren’t required by any stated requirement. LLMs tend to over-generate, so high coverage + high superfluity = noise. The ratio matters.

4. Expert Qualitative Assessment

Would a practitioner actually use this?

Human experts evaluate holistically: naming quality, organization, usability, whether the modeled solution matches how domain practitioners actually think. Catches issues automated metrics miss.

Why All Four Matter

DimensionWhat It Catches
CoverageIncomplete outputs, missing requirements
StructuralTechnical errors, standards violations
ConcisenessOver-generation, noise, redundancy
ExpertUsability issues, naming problems, conceptual mismatches

An output can score well on three dimensions and fail the fourth. High coverage + clean structure + high redundancy = bloated. High coverage + low redundancy + poor usability = technically correct but painful to use.

Application Beyond Ontologies

This framework transfers to other structured LLM outputs:

  • Generated code: tests pass (coverage), linter clean (structural), minimal lines (conciseness), code review approval (expert)
  • Data schemas: validates sample data (coverage), schema valid (structural), no unused fields (conciseness), matches domain model (expert)
  • Documentation: answers stated questions (coverage), follows style guide (structural), no tangents (conciseness), actually helpful (expert)

Limitations

  • Expert evaluation is expensive and subjective
  • Dimensions can trade off against each other
  • Thresholds for “good enough” are context-dependent
  • Doesn’t capture downstream consequences of subtle errors

Related: [None yet]