Attention Is All You Need
Citation
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention Is All You Need. Advances in Neural Information Processing Systems, 30.
Core Contribution
Introduced the Transformer architecture, the first sequence transduction model based entirely on attention, eliminating recurrence and convolution. This architecture became the foundation for GPT, BERT, and all subsequent large language models.
Key Framing
The dominant assumption at the time: attention mechanisms augment recurrent networks. The authors asked whether attention could replace recurrence entirely. The bet paid off, not just matching but exceeding performance while enabling massive parallelization.
Strategic Value for heyMHK
This paper provides foundational definitions for understanding how modern AI systems process sequences. The attention mechanism concept recurs constantly when discussing how LLMs “reason,” what context windows mean, and why certain failure modes emerge.
Atoms Extracted
- 05-atom—self-attention-definition
- 05-atom—query-key-value-framework
- 05-atom—multi-head-attention-definition
- 05-atom—positional-encoding-definition
- 05-atom—attention-path-length-observation
- 05-atom—attention-heads-specialization
- 05-atom—sequential-bottleneck-problem
Molecules Derived
Notes
The appendix visualizations showing how different attention heads specialize (some track syntactic structure, others semantic relationships) are particularly valuable for explaining AI interpretability limitations, we can see that heads specialize but not always why they attend to what they attend to.