A 24-trillion-token web dataset with document-level metadata just dropped on @huggingface
License: apache-2.0
ESSENTIAL-WEB v1.0 collects 24 trillion tokens from Common Crawl. Each document is labeled with a 12-field taxonomy covering topic, page type, complexity, and quality .___LINEBREAK_... See more