Large Language Models (LLMs)
sari and
Large Language Models (LLMs)
sari and
Meta AI released LLaMA ... and they included a paper which described exactly what it was trained on. It was 5TB of data.
2/3 of it was from Common Crawl. It had content from GitHub, Wikipedia, ArXiv, StackExchange and something called “Books”.
What’s Books? 4.5% of the training data was books. Part of this was Project Gutenberg, which is public dom
Darren LI and
“The fact that these things model language is probably one of the biggest discoveries in history. That you (LLM) can learn language by just predicting the next word … — that’s just shocking to me.”
- Mikhail Belkin, computer scientist at the University of California