Automated Ontology Derivation Pattern
Context
You have a large corpus of unstructured documents in a specialized domain. You need structured knowledge for retrieval, but manual ontology creation is prohibitively expensive and doesn’t scale.
Problem
Traditional ontology engineering requires domain experts to manually define classes, properties, and relationships. This works for stable, well-bounded domains but fails when:
- The corpus is large and evolving
- Domain expertise is scarce or expensive
- The structure needs to emerge from the documents themselves
Meanwhile, direct clustering (as in GraphRAG) captures co-occurrence patterns but loses categorical structure that domain experts would recognize.
Solution
Derive ontology structure through layered extraction and community detection:
- Entity Extraction: Pull named entities and their properties from text using LLM-based NER
- Relationship Extraction: Identify relationships between entities as subject-predicate-object triples
- Initial Graph Construction: Build a graph where entities are nodes, relationships are edges
- Similarity Clustering: Group similar entities by name and definition embeddings into candidate classes
- Community Detection: Apply algorithms (Leiden, Louvain) to partition the graph into communities that become ontology classes
- Property Synthesis: Use LLMs to generalize properties across class members
- Hierarchy Derivation: Perform sub-community detection within classes to create hierarchical structure
The key insight: use community detection to find natural groupings, but then apply LLM-based synthesis to ensure the resulting classes have coherent, generalizable properties, not just statistical clusters.
Consequences
Benefits:
- Scales to large corpora without manual curation
- Structure emerges from the documents themselves
- Preserves hierarchical relationships that pure clustering loses
Costs:
- Computationally expensive (300+ minutes for 1M tokens)
- Quality depends on LLM extraction accuracy
- May require domain-specific prompts for entity/relationship extraction
Tradeoffs:
- The resulting ontology is descriptive (what’s in the corpus) not prescriptive (what should be in the domain)
- Community boundaries may not align with expert intuitions
- Shallower hierarchies than hand-crafted ontologies typically have
When This Applies
- Large technical documentation corpora
- Domains where expert-created ontologies don’t exist
- Retrieval systems that need categorical structure for comprehensiveness
- Situations where “good enough” automated structure beats waiting for perfect manual curation
Related: 06-atom—ontological-integrity, 06-molecule—knowledge-graph-construction