Attention Heads Learn Specialized Tasks

Different attention heads in the same layer develop distinct specializations. Some track syntactic dependencies (connecting verbs to their objects across intervening phrases). Others handle anaphora resolution (connecting pronouns to their referents). Still others seem to follow positional patterns.

The original Transformer paper’s appendix visualizations show this clearly: one head consistently attends from a verb to distant words that complete its phrase; another sharply attends from pronouns back to the nouns they reference.

What this suggests: multi-head attention isn’t just redundancy or ensembling. The model learns to decompose sequence understanding into multiple parallel relationship-tracking tasks. Each head can attend to different aspects simultaneously.

What it doesn’t tell us: why heads specialize the way they do, or whether the specializations are stable across training runs. We can observe that they specialize, visualize attention patterns, but inferring the “purpose” of a head remains interpretive. The structure of attention is more legible than most neural network internals, but still far from transparent.

Related: 05-atom—multi-head-attention-definition, 05-molecule—attention-mechanism-concept