A look back at AlphaGo—the first AI system that beat the world champions at the game of Go, decades before it was thought possible—is useful here as well.
In step 1, AlphaGo was trained by imitation learning on expert human Go games. This gave it a foundation.
In step 2, AlphaGo played millions of games against itself. This let it become superhuman
In addition to insider bullishness, I think there’s a strong intuitive case for why it should be possible to find ways to train models with much better sample efficiency (algorithmic improvements that let them learn more from limited data). Consider how you or I would learn from a really dense math textbook:
You can go somewhat further by repeating data, but academic work on this suggests that repetition only gets you so far, finding that after 16 epochs (a 16-fold repetition), returns diminish extremely fast to nil. At some point, even with more (effective) compute, making your models better can become much tougher because of the data constraint. This... See more
There is a potentially important source of variance for all of this: we’re running out of internet data. That could mean that, very soon, the naive approach to pretraining larger language models on more scraped data could start hitting serious bottlenecks.
Frontier models are already trained on much of the internet. Llama 3, for example, was... See more
We can decompose the progress in the four years from GPT-2 to GPT-4 into three categories of scaleups:
Compute : We’re using much bigger computers to train these models.
Algorithmic efficiencies : There’s a continuous trend of algorithmic progress. Many of these act as “compute multipliers,” and we can put them on a unified scale of growing
What could ambitious unhobbling over the coming years look like? The way I think about it, there are three key ingredients:
1. Solving the “onboarding problem”
GPT-4 has the raw smarts to do a decent chunk of many people’s jobs, but it’s sort of like a smart new hire that just showed up 5 minutes ago: it doesn’t have... See more
What a modern LLM does during training is, essentially, very very quickly skim the textbook, the words just flying by , not spending much brain power on it.
Rather, when you or I read that math textbook, we read a couple pages slowly; then have an internal monologue about the material in our heads and talk about it with a few study-buddies; read
There is a potentially important source of variance for all of this: we’re running out of internet data. That could mean that, very soon, the naive approach to pretraining larger language models on more scraped data could start hitting serious bottlenecks.