The PII Independence Assumption
A privacy-preserving design principle: assume that correlations between personally identifiable information (names, locations) and research-relevant information are independent.
Under this assumption, you can:
- Redact PII from training data
- Re-insert arbitrary permutations of names and locations before training
- Train models without affecting performance on relevant tasks
This provides k-anonymity through generalization. It also has an accidental benefit: models trained this way can decode any name in their vocabulary in a context-independent manner.
When this holds: Market research transcripts, survey responses, customer feedback, cases where what people say matters more than who said it.
When this breaks: Cases where identity correlates with content. Medical records where patient history matters. Legal documents where party identity is relevant. Social network analysis.
The assumption is a design choice, not a universal truth. It works when you can defend the independence claim for your specific domain.
Related: [None yet]