The PII Independence Assumption

A privacy-preserving design principle: assume that correlations between personally identifiable information (names, locations) and research-relevant information are independent.

Under this assumption, you can:

  1. Redact PII from training data
  2. Re-insert arbitrary permutations of names and locations before training
  3. Train models without affecting performance on relevant tasks

This provides k-anonymity through generalization. It also has an accidental benefit: models trained this way can decode any name in their vocabulary in a context-independent manner.

When this holds: Market research transcripts, survey responses, customer feedback, cases where what people say matters more than who said it.

When this breaks: Cases where identity correlates with content. Medical records where patient history matters. Legal documents where party identity is relevant. Social network analysis.

The assumption is a design choice, not a universal truth. It works when you can defend the independence claim for your specific domain.

Related: [None yet]