Attention vs. Recurrence
The Two Paradigms
Recurrence (RNNs, LSTMs, GRUs): Process sequences step-by-step. Each position depends on the hidden state from the previous position. Information accumulates through sequential state updates.
Attention (Transformers): Process sequences in parallel. Each position directly attends to all other positions. Information routes through learned compatibility weights.
Key Differences
| Dimension | Recurrence | Attention |
|---|---|---|
| Sequential operations | O(n) — must wait for prior steps | O(1) — all positions computed together |
| Path length between positions | O(n) — distant positions require many hops | O(1) — direct connection in one operation |
| Memory complexity | O(1) per step — fixed hidden state | O(n²) — every position attends to every other |
| Locality bias | Built-in — nearby positions naturally connected | None — must be learned from data |
| Long-range dependencies | Difficult — gradients vanish over long paths | Easier structurally — but still challenging in practice |
When Each Applies
Recurrence still makes sense when:
- Sequence length is very long and quadratic attention is prohibitive
- Online processing is required (streaming, real-time)
- Strong sequential inductive biases are desired
- Memory efficiency matters more than training speed
Attention dominates when:
- Parallel training on GPUs/TPUs is available
- Sequences fit within practical context windows
- Long-range dependencies are important
- Scale (more parameters, more data) is the strategy
The Tradeoff That Mattered
The Transformer bet was: lose the sequential inductive bias, gain massive parallelization. At scale, the parallelization wins outweigh what’s lost.
This isn’t universally true, for small data or specific domains where locality matters, the inductive bias of recurrence might still help. But for language modeling at scale, the bet paid off decisively.
Practical Implications
Modern LLMs are entirely attention-based. This means:
- Context windows are hard limits (quadratic scaling)
- The model doesn’t inherently “remember” in a human sense
- Information not in the context window doesn’t exist to the model
- Position in context matters (models learn position-dependent patterns)
- No hidden state carries forward between calls
Related: 05-molecule—attention-mechanism-concept, 05-atom—context-window-limitations