togethercomputer/RedPajama-Data-V2 · Datasets at Hugging Fac...

togethercomputer/RedPajama-Data-V2 · Datasets at Hugging Face

RelatedHighlights

Datasets as Imagination

Data-Juicer: A One-Stop Data Processing System for Large Language Models

Data-Juicer is a one-stop data processing system to make data higher-quality, juicier, and more digestible for LLMs. This project is being actively updated and maintained, and we will periodically enhance and add more features and data recipes. We welcome you to join us in pro... See more

alibaba • GitHub - alibaba/data-juicer: A one-stop data processing system to make data higher-quality, juicier, and more digestible for LLMs! 🍎 🍋 🌽 ➡️ ➡️🍸 🍹 🍷为大语言模型提供更高质量、更丰富、更易”消化“的数据！

Nicolay Gerold added

here are two basic approaches to creating AI datasets. The first one, which is typical of the case we have been studying, a pool of open works is purposefully chosen to ensure license compliance. The second approach creates the dataset by scraping the “raw internet” and relying on copyright exceptions. LAION , a dataset of 400 million image-text pa... See more

Alek Tarkowski • Filling the governance vacuum related to the use of information commons for AI training

madisen added

The dataset was constructed by pooling 60 existing robot datasets from 34 robotic research labs around the world.

Open X-Embodiment: Robotic Learning Datasets and RT-X Models

Darren LI added

Following the open data movement, often embracing a not-for-profit philosophy, many data sets are available online from fields like biodiversity, business, cartography, chemistry, genomics, and medicine. Look at one central index, www.kdnuggets.com/datasets, and you’ll see what amounts to lists of lists of data resources.

Eric Siegel • Predictive Analytics

The starting point of any RAG system is its source data, often consisting of a vast corpus of text documents, websites, or databases

DataStax • Retrieval Augmented Generation (RAG) Explained: Understanding Key Concepts

added

NeMo Curator

NeMo Curator is a Python library specifically designed for scalable and efficient dataset preparation. It greatly accelerates data curation by leveraging GPUs with Dask and RAPIDS, resulting in significant time savings. The library provides a customizable and modular interface, simplifying pipeline expansion and accelerating model conv... See more

GitHub - NVIDIA/NeMo-Curator: Scalable toolkit for data curation