artificial intelligence

A look back at AlphaGo—the first AI system that beat the world champions at the game of Go, decades before it was thought possible—is useful here as well.

In step 1, AlphaGo was trained by imitation learning on expert human Go games. This gave it a foundation.

In step 2, AlphaGo played millions of games against itself. This let it become superhuman

SITUATIONAL AWARENESS - The Decade Ahead • I. From GPT-4 to AGI: Counting the OOMs

In addition to insider bullishness, I think there’s a strong intuitive case for why it should be possible to find ways to train models with much better sample efficiency (algorithmic improvements that let them learn more from limited data). Consider how you or I would learn from a really dense math textbook:

What a modern LLM does during training

SITUATIONAL AWARENESS - The Decade Ahead • I. From GPT-4 to AGI: Counting the OOMs

You can go somewhat further by repeating data, but academic work on this suggests that repetition only gets you so far, finding that after 16 epochs (a 16-fold repetition), returns diminish extremely fast to nil. At some point, even with more (effective) compute, making your models better can become much tougher because of the data constraint. This... See more

SITUATIONAL AWARENESS - The Decade Ahead • I. From GPT-4 to AGI: Counting the OOMs

There is a potentially important source of variance for all of this: we’re running out of internet data. That could mean that, very soon, the naive approach to pretraining larger language models on more scraped data could start hitting serious bottlenecks.

Frontier models are already trained on much of the internet. Llama 3, for example, was... See more

SITUATIONAL AWARENESS - The Decade Ahead • I. From GPT-4 to AGI: Counting the OOMs

We can decompose the progress in the four years from GPT-2 to GPT-4 into three categories of scaleups:

Compute : We’re using much bigger computers to train these models.

Algorithmic efficiencies : There’s a continuous trend of algorithmic progress. Many of these act as “compute multipliers,” and we can put them on a unified scale of growing