Positional Encoding

A mechanism for injecting sequence order information into attention-based models. Because attention is inherently order-agnostic (treating inputs as a set, not a sequence), position must be explicitly encoded.

The original Transformer used sinusoidal functions at different frequencies:

PE(pos, 2i) = sin(pos / 10000^(2i/d)) PE(pos, 2i+1) = cos(pos / 10000^(2i/d))

These encodings are added directly to the input embeddings. The authors hypothesized that sinusoids would let the model learn relative positions easily, since PE(pos+k) can be expressed as a linear function of PE(pos).

Learned positional embeddings work comparably but don’t extrapolate to longer sequences than seen in training. Sinusoidal encodings theoretically can, though in practice, models still struggle with lengths far beyond their training distribution.

>heyMHK

Positional Encoding

Positional Encoding

Properties

Graph view

Backlinks