Attention vs. Recurrence

The Two Paradigms

Recurrence (RNNs, LSTMs, GRUs): Process sequences step-by-step. Each position depends on the hidden state from the previous position. Information accumulates through sequential state updates.

Attention (Transformers): Process sequences in parallel. Each position directly attends to all other positions. Information routes through learned compatibility weights.

Key Differences

Dimension	Recurrence	Attention
Sequential operations	O(n) — must wait for prior steps	O(1) — all positions computed together
Path length between positions	O(n) — distant positions require many hops	O(1) — direct connection in one operation
Memory complexity	O(1) per step — fixed hidden state	O(n²) — every position attends to every other
Locality bias	Built-in — nearby positions naturally connected	None — must be learned from data
Long-range dependencies	Difficult — gradients vanish over long paths	Easier structurally — but still challenging in practice

When Each Applies

Recurrence still makes sense when:

Sequence length is very long and quadratic attention is prohibitive
Online processing is required (streaming, real-time)
Strong sequential inductive biases are desired
Memory efficiency matters more than training speed

Attention dominates when:

Parallel training on GPUs/TPUs is available
Sequences fit within practical context windows
Long-range dependencies are important
Scale (more parameters, more data) is the strategy

The Tradeoff That Mattered

The Transformer bet was: lose the sequential inductive bias, gain massive parallelization. At scale, the parallelization wins outweigh what’s lost.

This isn’t universally true, for small data or specific domains where locality matters, the inductive bias of recurrence might still help. But for language modeling at scale, the bet paid off decisively.

Practical Implications

Modern LLMs are entirely attention-based. This means:

Context windows are hard limits (quadratic scaling)
The model doesn’t inherently “remember” in a human sense
Information not in the context window doesn’t exist to the model
Position in context matters (models learn position-dependent patterns)
No hidden state carries forward between calls

>heyMHK

Attention vs. Recurrence

Attention vs. Recurrence

The Two Paradigms

Key Differences

When Each Applies

The Tradeoff That Mattered

Practical Implications

Properties

Graph view

Table of Contents

Backlinks