Multi-Dimensional LLM Output Evaluation

Overview

When evaluating LLM-generated structured artifacts, no single metric captures quality adequately. A multi-dimensional approach reveals failure modes that single-metric evaluation misses.

Components

1. Requirements Coverage

Does the output satisfy stated requirements?

Check whether the output contains what was asked for. In ontology terms: can you write the queries the requirements specify? In code terms: do the tests pass? This is necessary but not sufficient.

2. Structural Quality

Is the output well-formed according to domain standards?

Automated scanners (like OOPS! for ontologies, linters for code) catch common structural errors: circular references, malformed constraints, inconsistent naming. These are objective and automatable.

3. Conciseness/Precision

Does the output contain only what’s needed?

Count superfluous elements, things that exist in the output but aren’t required by any stated requirement. LLMs tend to over-generate, so high coverage + high superfluity = noise. The ratio matters.

4. Expert Qualitative Assessment

Would a practitioner actually use this?

Human experts evaluate holistically: naming quality, organization, usability, whether the modeled solution matches how domain practitioners actually think. Catches issues automated metrics miss.

Why All Four Matter

Dimension	What It Catches
Coverage	Incomplete outputs, missing requirements
Structural	Technical errors, standards violations
Conciseness	Over-generation, noise, redundancy
Expert	Usability issues, naming problems, conceptual mismatches

An output can score well on three dimensions and fail the fourth. High coverage + clean structure + high redundancy = bloated. High coverage + low redundancy + poor usability = technically correct but painful to use.

Application Beyond Ontologies

This framework transfers to other structured LLM outputs:

Generated code: tests pass (coverage), linter clean (structural), minimal lines (conciseness), code review approval (expert)
Data schemas: validates sample data (coverage), schema valid (structural), no unused fields (conciseness), matches domain model (expert)
Documentation: answers stated questions (coverage), follows style guide (structural), no tangents (conciseness), actually helpful (expert)

Limitations

Expert evaluation is expensive and subjective
Dimensions can trade off against each other
Thresholds for “good enough” are context-dependent
Doesn’t capture downstream consequences of subtle errors

Related: [None yet]

>heyMHK

Multi-Dimensional LLM Output Evaluation

Multi-Dimensional LLM Output Evaluation

Overview

Components

1. Requirements Coverage

2. Structural Quality

3. Conciseness/Precision

4. Expert Qualitative Assessment

Why All Four Matter

Application Beyond Ontologies

Limitations

Properties

Graph view

Table of Contents