Dual Extraction Architecture
Overview
A knowledge graph construction pipeline that supports both lightweight (dependency parsing) and heavyweight (LLM-based) extraction paths, enabling cost-accuracy tradeoffs based on deployment requirements.
Components
Lightweight Path (Dependency Parsing)
- Uses SpaCy’s dependency parser to extract entity-relation triples from syntactic structure
- Includes passive voice normalization, phrasal merging, and coreference resolution
- Domain-agnostic, requires no domain-specific training
- Achieves 94% of LLM performance at a fraction of the cost
Heavyweight Path (LLM-Based)
- Uses GPT-family models with few-shot prompting
- Higher accuracy on complex, ambiguous text
- Appropriate for critical document collections where maximum precision matters
Unified Backend
- Both paths produce entity-relation graphs in the same format
- Stored in a single graph database for downstream retrieval
- Enables incremental construction with mixed extraction modes
When to Use
Use the lightweight path as the default. Switch to heavyweight extraction for:
- High-stakes documents where extraction errors have significant consequences
- Text with complex implicit relationships that syntactic parsing can’t capture
- Domains where the 6% performance gap matters more than the cost difference
The architecture allows mixing modes within a corpus, cheap extraction for bulk content, expensive extraction for critical documents.
Limitations
- Dependency parsing misses context-dependent or implicit relations
- No automatic detection of which documents need heavyweight extraction
- Still requires human review for high-stakes applications
Related: 06-atom—construction-bottleneck-problem, 05-atom—the-94-percent-threshold, 07-molecule—good-enough-classical-nlp