Model Collapse

A phenomenon where model training over-relies on synthetic data, resulting in data points disappearing from the distribution of the new model’s outputs. Each generation of model-generated training data loses some of the variance present in the original distribution, leading to progressive homogenization.

Beyond threatening overall model robustness, model collapse amplifies any homogenization already present in the source model used to generate synthetic training data. The tails of distributions, often where interesting edge cases and minority perspectives live, erode first.

This creates a feedback loop: as more content online becomes AI-generated, and that content gets scraped for future training, the information ecosystem becomes increasingly self-referential.

Related: 05-atom—algorithmic-monoculture-definition