The Illustrated Transformer

Saved by Andrés and

RelatedCollectionsHighlightsNotes

Self-Attention in Detail

Let’s first look at how to calculate self-attention using vectors, then proceed to look at how it’s actually implemented – using matrices.

The first step in calculating self-attention is to create three vectors from each of the encoder’s input vectors (in this case, the embedding of each word). So for each word, we create a ... See more

Jay Alammar • The Illustrated Transformer

Luc Cheung

Self-Attention at a High Level

Don’t be fooled by me throwing around the word “self-attention” like it’s a concept everyone should be familiar with. I had personally never came across the concept until reading the Attention is All You Need paper. Let us distill how it works.

Say the following sentence is an input sentence we want to translate:

” The ... See more

Jay Alammar • The Illustrated Transformer

Luc Cheung