Attention vs. Recurrence

The Two Paradigms

Recurrence (RNNs, LSTMs, GRUs): Process sequences step-by-step. Each position depends on the hidden state from the previous position. Information accumulates through sequential state updates.

Attention (Transformers): Process sequences in parallel. Each position directly attends to all other positions. Information routes through learned compatibility weights.

Key Differences

DimensionRecurrenceAttention
Sequential operationsO(n) — must wait for prior stepsO(1) — all positions computed together
Path length between positionsO(n) — distant positions require many hopsO(1) — direct connection in one operation
Memory complexityO(1) per step — fixed hidden stateO(n²) — every position attends to every other
Locality biasBuilt-in — nearby positions naturally connectedNone — must be learned from data
Long-range dependenciesDifficult — gradients vanish over long pathsEasier structurally — but still challenging in practice

When Each Applies

Recurrence still makes sense when:

  • Sequence length is very long and quadratic attention is prohibitive
  • Online processing is required (streaming, real-time)
  • Strong sequential inductive biases are desired
  • Memory efficiency matters more than training speed

Attention dominates when:

  • Parallel training on GPUs/TPUs is available
  • Sequences fit within practical context windows
  • Long-range dependencies are important
  • Scale (more parameters, more data) is the strategy

The Tradeoff That Mattered

The Transformer bet was: lose the sequential inductive bias, gain massive parallelization. At scale, the parallelization wins outweigh what’s lost.

This isn’t universally true, for small data or specific domains where locality matters, the inductive bias of recurrence might still help. But for language modeling at scale, the bet paid off decisively.

Practical Implications

Modern LLMs are entirely attention-based. This means:

  • Context windows are hard limits (quadratic scaling)
  • The model doesn’t inherently “remember” in a human sense
  • Information not in the context window doesn’t exist to the model
  • Position in context matters (models learn position-dependent patterns)
  • No hidden state carries forward between calls

Related: 05-molecule—attention-mechanism-concept, 05-atom—context-window-limitations