Softmax Bottleneck
The softmax function used in language model output layers limits the expressiveness of output distributions. When the true distribution over next tokens is more complex than the model can represent, generation quality suffers.
The Mechanism
Softmax converts logits to probabilities. But softmax can only produce log-linear distributions over tokens. If the true data distribution requires higher-rank representations, softmax becomes a bottleneck, forcing the model to approximate what it cannot exactly express.
Why It Matters
Multi-modal output distributions (where multiple very different continuations are equally valid) are particularly affected. The model may collapse probability mass onto fewer tokens than warranted, or spread it too thinly.
Mitigation Approaches
- Mixture of Softmax: Multiple softmax distributions combined
- Larger embedding dimensions: More expressiveness before the bottleneck
- Alternative output heads: Different architectures for specific output types
This is one of several architectural constraints that shape what models can and cannot do, independent of training data quality.
Related: 05-molecule—attention-mechanism-concept, 05-atom—uniform-confidence-problem