Dual Extraction Architecture

Overview

A knowledge graph construction pipeline that supports both lightweight (dependency parsing) and heavyweight (LLM-based) extraction paths, enabling cost-accuracy tradeoffs based on deployment requirements.

Components

Lightweight Path (Dependency Parsing)

  • Uses SpaCy’s dependency parser to extract entity-relation triples from syntactic structure
  • Includes passive voice normalization, phrasal merging, and coreference resolution
  • Domain-agnostic, requires no domain-specific training
  • Achieves 94% of LLM performance at a fraction of the cost

Heavyweight Path (LLM-Based)

  • Uses GPT-family models with few-shot prompting
  • Higher accuracy on complex, ambiguous text
  • Appropriate for critical document collections where maximum precision matters

Unified Backend

  • Both paths produce entity-relation graphs in the same format
  • Stored in a single graph database for downstream retrieval
  • Enables incremental construction with mixed extraction modes

When to Use

Use the lightweight path as the default. Switch to heavyweight extraction for:

  • High-stakes documents where extraction errors have significant consequences
  • Text with complex implicit relationships that syntactic parsing can’t capture
  • Domains where the 6% performance gap matters more than the cost difference

The architecture allows mixing modes within a corpus, cheap extraction for bulk content, expensive extraction for critical documents.

Limitations

  • Dependency parsing misses context-dependent or implicit relations
  • No automatic detection of which documents need heavyweight extraction
  • Still requires human review for high-stakes applications

Related: 06-atom—construction-bottleneck-problem, 05-atom—the-94-percent-threshold, 07-molecule—good-enough-classical-nlp