i've been trying to understand the transformer architecture for ages but it never clicked. so recently i started trying to learn its history and that has helped a LOT. to try to cement what i learned imma write up a essay.
Undefined Behavior presents: The Evolution of Attention
I just read this new paper from Google and I’m absolutely buzzing 🤯
The core idea is almost offensively simple: ditch recurrence and convolutions, and use only attention. That’s it. And somehow…it unlocks a whole new regime of performance, scale, and simplicity.
Here’s what blew my... See more