Implementing Self-Attention from Scratch in PyTorch
Another medium article: The Easiest Way to Understand Self-Attention with Code.
Attention Is All You Need
Introduces the Transformer, a novel neural network architecture based solely on attention mechanisms for sequence transduction, improving machine translation quality, training speed, and parallelization over recurrent and convolutional models.