The EDC Framework: Extract-Define-Canonicalize

A three-stage pattern for building knowledge graphs without predefined schemas. Rather than constraining extraction with upfront ontology design, EDC lets structure emerge from data and then normalizes it.

Overview

Extract → Define → Canonicalize

The framework separates the discovery of knowledge from the imposition of structure, allowing each to be optimized independently.

The Three Stages

1. Extract

Open extraction using few-shot prompting. The LLM generates comprehensive natural-language triples from text without schema constraints. This stage prioritizes coverage and discovery over structural consistency.

Output: A raw “open knowledge graph” (messy, potentially redundant, but capturing what’s actually in the source material.

2. Define

Semantic definition of extracted elements. The LLM generates natural language descriptions of the entities, relations, and types it discovered. This creates an intermediate representation that bridges raw extraction and formal structure.

Output: Entity and relation definitions that capture meaning without yet committing to a formal schema.

3. Canonicalize

Schema normalization through semantic comparison. Definitions are compared via vector similarity to cluster equivalent concepts and align with existing ontologies (or create new schema elements when none exist).

Output: A normalized knowledge graph with consistent typing, deduplicated entities, and either alignment to standard ontologies or coherent novel schema.

Why This Pattern Matters

EDC addresses a core tension in knowledge engineering: you can’t design a perfect schema without knowing what’s in the data, but traditional extraction requires a schema to operate.

By separating extraction (what’s there) from canonicalization (how it should be structured), EDC allows both to proceed with appropriate methods.

Limitations

  • Definition quality depends heavily on LLM capability and prompting
  • Canonicalization can merge things that shouldn’t be merged (false positives)
  • Works better for domains well-represented in LLM training data
  • Novel domain-specific distinctions may get collapsed into generic categories

When to Use

  • Open-domain knowledge graph construction
  • Bootstrapping structure from unstructured corpora
  • Situations where the schema genuinely isn’t known upfront
  • Complement to (not replacement for) expert-designed core ontologies

Related: 06-atom—schema-based-vs-schema-free-extraction, 06-molecule—top-down-vs-bottom-up-ontology, 06-atom—emergent-vs-designed-schemas