Automated Ontology Derivation Pattern

Context

You have a large corpus of unstructured documents in a specialized domain. You need structured knowledge for retrieval, but manual ontology creation is prohibitively expensive and doesn’t scale.

Problem

Traditional ontology engineering requires domain experts to manually define classes, properties, and relationships. This works for stable, well-bounded domains but fails when:

The corpus is large and evolving
Domain expertise is scarce or expensive
The structure needs to emerge from the documents themselves

Meanwhile, direct clustering (as in GraphRAG) captures co-occurrence patterns but loses categorical structure that domain experts would recognize.

Solution

Derive ontology structure through layered extraction and community detection:

Entity Extraction: Pull named entities and their properties from text using LLM-based NER
Relationship Extraction: Identify relationships between entities as subject-predicate-object triples
Initial Graph Construction: Build a graph where entities are nodes, relationships are edges
Similarity Clustering: Group similar entities by name and definition embeddings into candidate classes
Community Detection: Apply algorithms (Leiden, Louvain) to partition the graph into communities that become ontology classes
Property Synthesis: Use LLMs to generalize properties across class members
Hierarchy Derivation: Perform sub-community detection within classes to create hierarchical structure

The key insight: use community detection to find natural groupings, but then apply LLM-based synthesis to ensure the resulting classes have coherent, generalizable properties, not just statistical clusters.

Consequences

Benefits:

Scales to large corpora without manual curation
Structure emerges from the documents themselves
Preserves hierarchical relationships that pure clustering loses

Costs:

Computationally expensive (300+ minutes for 1M tokens)
Quality depends on LLM extraction accuracy
May require domain-specific prompts for entity/relationship extraction

Tradeoffs:

The resulting ontology is descriptive (what’s in the corpus) not prescriptive (what should be in the domain)
Community boundaries may not align with expert intuitions
Shallower hierarchies than hand-crafted ontologies typically have

When This Applies

Large technical documentation corpora
Domains where expert-created ontologies don’t exist
Retrieval systems that need categorical structure for comprehensiveness
Situations where “good enough” automated structure beats waiting for perfect manual curation

>heyMHK

Automated Ontology Derivation Pattern

Automated Ontology Derivation Pattern

Context

Problem

Solution

Consequences

When This Applies

Properties

Graph view

Table of Contents