LLMs
sari and
LLMs
sari and
Weird GPT token for Reddit user davidjl123, “a keen member of the /r/counting subreddit. He’s posted incremented numbers there well over 163,000 times. Presumably that subreddit ended up in the training data used to create the tokenizer used by GPT-2, and since that particular username showed up hundreds of thousands of times it ended up getting
... See more“The fact that these things model language is probably one of the biggest discoveries in history. That you (LLM) can learn language by just predicting the next word … — that’s just shocking to me.”
- Mikhail Belkin, computer scientist at the University of California
Meta AI released LLaMA ... and they included a paper which described exactly what it was trained on. It was 5TB of data.
2/3 of it was from Common Crawl. It had content from GitHub, Wikipedia, ArXiv, StackExchange and something called “Books”.
What’s Books? 4.5% of the training data was books. Part of this was Project Gutenberg, which is public
