The Attention Mechanism

What It Is

A way for neural networks to dynamically focus on different parts of their input based on learned relevance, rather than processing all information equally or in fixed sequential order.

At its core: compute compatibility scores between a query and a set of keys, then use those scores to weight the corresponding values into an output. The result is a context-aware representation where each position incorporates information from other positions proportional to their computed relevance.

Why It Matters

Attention is the fundamental operation underlying all modern large language models. When people talk about “context windows,” “how the model uses your prompt,” or “what the model pays attention to,” they’re describing consequences of attention mechanisms.

Understanding attention clarifies:

  • Why models can handle variable-length inputs
  • Why context window limits exist (attention is quadratic in sequence length)
  • What “in-context learning” actually involves (patterns in attention weights)
  • Why some long-range dependencies work better than others

How It Works

Single-head attention:

  1. Transform inputs into queries (Q), keys (K), and values (V)
  2. Compute attention scores: dot product of Q with all K, scaled by √dimension
  3. Apply softmax to get weights that sum to 1
  4. Multiply weights by values, sum to get output

Multi-head attention:

  • Run the above in parallel with different learned projections
  • Each “head” can specialize in different relationship types
  • Concatenate results, project back to original dimension

Positional encoding:

  • Attention is inherently order-agnostic (treats inputs as sets)
  • Position information must be explicitly added to embeddings
  • Sinusoidal functions or learned embeddings provide this

Implications

The architectural decision to rely entirely on attention (no recurrence, no convolution) unlocked massive parallelization during training. Each position can be computed simultaneously rather than waiting for prior positions.

This came with tradeoffs: attention scales quadratically with sequence length, locality isn’t privileged, and the model must learn from scratch relationships that other architectures encode structurally.

The attention mechanism provides limited but real interpretability, we can visualize which tokens influenced which, even if we can’t always explain why.

Related: 05-atom—sequential-bottleneck-problem, 05-atom—attention-path-length-observation, 05-atom—attention-vs-understanding-distinction