Label Distillation

A technique for constructing training data from an unlabeled corpus by progressively filtering to samples where labels can be reliably inferred.

The process:

  1. Start with a large unlabeled corpus
  2. Apply high-confidence heuristics to identify candidate samples (e.g., proper-noun detection for NER tasks)
  3. Cross-reference with known lists (external data, frequency patterns) to assign reliable labels
  4. Use the distilled subset to train models, then iterate, better models enable higher-quality distillation

Example from NER: From ~2,000,000 sentences, proper-noun detection identifies 120,000 candidates. Cross-referencing with known entity lists yields ~80,000 sentences with reliable labels. This distilled corpus trains a model that can label the remainder.

The key insight: you don’t need labels for everything. You need a method for finding samples where labels are trustworthy.

Related: [None yet]