The Sequential Bottleneck Problem
Recurrent neural networks process sequences one step at a time, each position depends on the hidden state from the previous position. This sequential dependency fundamentally limits parallelization.
You can’t compute step 100 until you’ve computed steps 1-99. At longer sequences, this becomes a critical constraint: memory limits batching, and training time grows linearly with sequence length.
This observation motivated the Transformer’s core design decision: replace sequential recurrence with parallel attention. Self-attention connects all positions with a constant number of sequential operations, regardless of sequence length.
The tradeoff: attention scales quadratically with sequence length (every position attends to every other position), while recurrence scales linearly. For typical sentence lengths, the parallelization gains dominate. For very long sequences, this becomes a limiting factor, hence ongoing research into sparse attention, sliding windows, and other efficiency mechanisms.
Related: 05-atom—attention-path-length-observation, 05-atom—context-window-limitations, 05-molecule—attention-vs-recurrence-comparison