Multi-Head Attention
Running multiple attention operations in parallel, each with different learned projections, then concatenating the results.
Instead of one attention function attending to all information at once, multi-head attention lets the model attend to information from different representational subspaces simultaneously. Each “head” can specialize in different types of relationships.
The original Transformer used 8 parallel heads. Each operates in a reduced dimension (512/8 = 64), keeping total computation comparable to single-head attention at full dimension.
Why it matters: a single attention operation tends to average. Multiple heads can capture distinct relationship types, one head might track syntactic structure, another semantic similarity, another positional patterns. The model learns which specializations are useful.
Related: 05-atom—self-attention-definition, 05-atom—attention-heads-specialization, 05-molecule—attention-mechanism-concept