Sublime

An inspiration engine for ideas

AllPeopleCollectionsArticlesAudioBooksFilesHighlightsImagesLinksNotesTextTweetsVideosSocial

RedPajama-V2 is an open dataset for training large language models. The dataset includes over 100B text

documents coming from 84 CommonCrawl snapshots and processed using

the CCNet pipeline. Out of these, there are 30B documents in the corpus

that additionally come with quality signals. In addition, we also provide the ids of duplicated documents... See more

togethercomputer/RedPajama-Data-V2 · Datasets at Hugging Face

Anthropic/EconomicIndex · Datasets at Hugging Face

huggingface.co

The #tidyplots paper is now published in @iMetaScience 🎉🥳🔥 Whenever you use tidyplots in your published work, consider citing https://t.co/n58lCO3Epv 🙏 #rstats #dataviz #phd https://t.co/oJqvscdv1M

Jan Broder Engler

x.com

一个简单易用的大模型微调数据集创建工具：Easy DataSet。提供了直观的界面，上传文件，智能分割内容，生成问题，并为模型微调生成高质量的训练数据。 GitHub：https://t.co/s36Y3sHhm9 此外，还支持导出 Alpaca、JSON 等多种格式数据集，同时兼容所有遵循 OpenAI 格式的 LLM API。提供 Windows、MacOS 和 Linux 安装包下载，开箱即用，也支持 Docker 部署和 NPM 安装。

GitHubDaily

x.com

This Github has a very wide collection of High-quality datasets, tools, and concepts for LLM fine-tuning. All the datasets listed here should be under permissive licensing (Apache 2.0, MIT, cc-by-4.0, etc.). Categorized into segments like Math & Logic, Code, Conversation & Role-Play, Agent &... See more

Rohan Paul

x.com

Our first batch of four diffusion models, along with super basic CLIP guided inference code, is out! https://t.co/OUOMleFolF The models are: Danbooru SFW 128x128 ImageNet 128x128 WikiArt 128x128 WikiArt 256x256 No Colab yet but soon!... See more

Rivers Have Wings

x.com

Turn any document into a high-quality synthetic dataset using Llama! Unsloth AI and Meta released a free notebook that transforms your documents into high-quality synthetic datasets, and then finetuning them using Llama. 100% Open Source https://t.co/EdXMYYzi7i

Sumanth

x.com

🚨New paper: Source2Synth🚨 - Generates synthetic examples grounded in real data - Curation step makes data high quality based on answerability - Improves performance on two challenging domains: Multi-hop QA and using tools: SQL for tabular QA https://t.co/4uggsiniIv 🧵(1/4)... See more

Jason Weston

x.com

Yesterday I reviewed ~30 open datasets and there was a spectrum of quality. While I don't believe all data need to be identically formatted, I do think data should meet these standards and include thorough documentation (e.g., a data dictionary and project summary). https://t.co/0Qe1ASmoEk

Crystal Lewis @cghlewis.bsky.social

x.com