Label Distillation
A technique for constructing training data from an unlabeled corpus by progressively filtering to samples where labels can be reliably inferred.
The process:
- Start with a large unlabeled corpus
- Apply high-confidence heuristics to identify candidate samples (e.g., proper-noun detection for NER tasks)
- Cross-reference with known lists (external data, frequency patterns) to assign reliable labels
- Use the distilled subset to train models, then iterate, better models enable higher-quality distillation
Example from NER: From ~2,000,000 sentences, proper-noun detection identifies 120,000 candidates. Cross-referencing with known entity lists yields ~80,000 sentences with reliable labels. This distilled corpus trains a model that can label the remainder.
The key insight: you don’t need labels for everything. You need a method for finding samples where labels are trustworthy.
Related: [None yet]