AI skepticism
Alara and
AI skepticism
Alara and
Weird GPT token for Reddit user davidjl123, “a keen member of the /r/counting subreddit. He’s posted incremented numbers there well over 163,000 times. Presumably that subreddit ended up in the training data used to create the tokenizer used by GPT-2, and since that particular username showed up hundreds of thousands of times it ended up getting
... See moreMeta AI released LLaMA ... and they included a paper which described exactly what it was trained on. It was 5TB of data.
2/3 of it was from Common Crawl. It had content from GitHub, Wikipedia, ArXiv, StackExchange and something called “Books”.
What’s Books? 4.5% of the training data was books. Part of this was Project Gutenberg, which is public
"A key challenge of (LLMs) is that they do not come with a manual! They come with a “Twitter influencer manual” instead, where lots of people online loudly boast about the things they can do with a very low accuracy rate, which is really frustrating..."
Simon Willison, attempting to explain LLM
“A more practical answer is that it’s a file. This right here is a large language model, called Vicuna 7B. It’s a 4.2 gigabyte file on my computer. If you open the file, it’s just numbers. These things are giant binary blobs of numbers…”
Simon Willison, attempting to explain LLM
