Constant-Time Attention Paths
In self-attention, any two positions in a sequence are connected by a path of length O(1) (a constant number of operations, regardless of how far apart they are in the sequence.
In recurrent networks, information between position 1 and position 100 must traverse 99 sequential steps. In convolutional networks, the path length depends on kernel size and depth. In attention, every position directly attends to every other position in a single operation.
Why this matters: long-range dependencies are notoriously hard for recurrent networks to learn. The longer the path, the more chances for gradients to vanish or explode, and the harder it is to maintain relevant information. Constant-time paths make it structurally easier to learn relationships between distant positions.
The flip side: attention doesn’t inherently encode local structure. Nearby tokens aren’t privileged over distant ones, the model must learn any locality biases from data. This is both a strength (no hardcoded assumptions) and a weakness (must spend capacity learning things recurrence gets “for free”).
Related: 05-atom—sequential-bottleneck-problem, 05-atom—self-attention-definition, 05-molecule—attention-vs-recurrence-comparison