Attention Is Weighting, Not Understanding
Attention mechanisms compute weighted combinations of input representations. They determine what to focus on, not what it means.
The attention weights show which tokens influenced which other tokens during processing. This is useful, but easily misinterpreted. High attention weight between two tokens doesn’t mean the model “understands” their relationship, it means their vectors had high compatibility scores and one influenced the other’s representation.
The pattern I keep encountering: attention visualizations get treated as explanations. “The model attended to the verb when processing the subject” sounds like reasoning, but it’s describing a weighted sum, not a logical inference.
What attention does: routes information selectively through the network, allowing different parts of the input to influence each other based on learned compatibility functions.
What attention doesn’t do: reason, understand, explain its own behavior, or guarantee that high-attention connections are semantically meaningful.
Related: 05-atom—attention-heads-specialization, 05-atom—uniform-confidence-problem