Is Data Still a Moat?


Awesome and highly useful: FineWeb-Edu 📚👏
High quality LLM dataset filtering the original 15 trillion FineWeb tokens to 1.3 trillion of the highest (educational) quality, as judged by a Llama 3 70B. +A highly detailed paper.
Turns out that LLMs learn a lot better and faster from educational content as... See more