Dual Extraction Architecture

Overview

A knowledge graph construction pipeline that supports both lightweight (dependency parsing) and heavyweight (LLM-based) extraction paths, enabling cost-accuracy tradeoffs based on deployment requirements.

Components

Lightweight Path (Dependency Parsing)

Uses SpaCy’s dependency parser to extract entity-relation triples from syntactic structure
Includes passive voice normalization, phrasal merging, and coreference resolution
Domain-agnostic, requires no domain-specific training
Achieves 94% of LLM performance at a fraction of the cost

Heavyweight Path (LLM-Based)

Uses GPT-family models with few-shot prompting
Higher accuracy on complex, ambiguous text
Appropriate for critical document collections where maximum precision matters

Unified Backend

Both paths produce entity-relation graphs in the same format
Stored in a single graph database for downstream retrieval
Enables incremental construction with mixed extraction modes

When to Use

Use the lightweight path as the default. Switch to heavyweight extraction for:

High-stakes documents where extraction errors have significant consequences
Text with complex implicit relationships that syntactic parsing can’t capture
Domains where the 6% performance gap matters more than the cost difference

The architecture allows mixing modes within a corpus, cheap extraction for bulk content, expensive extraction for critical documents.

Limitations

Dependency parsing misses context-dependent or implicit relations
No automatic detection of which documents need heavyweight extraction
Still requires human review for high-stakes applications

>heyMHK

Dual Extraction Architecture

Dual Extraction Architecture

Overview

Components

When to Use

Limitations

Properties

Graph view

Table of Contents

Backlinks