togethercomputer/RedPajama-Data-V2 · Datasets at Hugging Face
Sarah Drinkwater and added
Data-Juicer: A One-Stop Data Processing System for Large Language Models
Data-Juicer is a one-stop data processing system to make data higher-quality, juicier, and more digestible for LLMs. This project is being actively updated and maintained, and we will periodically enhance and add more features and data recipes. We welcome you to join us in pro... See more
Data-Juicer is a one-stop data processing system to make data higher-quality, juicier, and more digestible for LLMs. This project is being actively updated and maintained, and we will periodically enhance and add more features and data recipes. We welcome you to join us in pro... See more
alibaba • GitHub - alibaba/data-juicer: A one-stop data processing system to make data higher-quality, juicier, and more digestible for LLMs! 🍎 🍋 🌽 ➡️ ➡️🍸 🍹 🍷为大语言模型提供更高质量、更丰富、更易”消化“的数据!
Nicolay Gerold added
here are two basic approaches to creating AI datasets. The first one, which is typical of the case we have been studying, a pool of open works is purposefully chosen to ensure license compliance. The second approach creates the dataset by scraping the “raw internet” and relying on copyright exceptions. LAION , a dataset of 400 million image-text pa... See more
Alek Tarkowski • Filling the governance vacuum related to the use of information commons for AI training
madisen added
The dataset was constructed by pooling 60 existing robot datasets from 34 robotic research labs around the world.
Open X-Embodiment: Robotic Learning Datasets and RT-X Models
Darren LI added
Following the open data movement, often embracing a not-for-profit philosophy, many data sets are available online from fields like biodiversity, business, cartography, chemistry, genomics, and medicine. Look at one central index, www.kdnuggets.com/datasets, and you’ll see what amounts to lists of lists of data resources.
Eric Siegel • Predictive Analytics
The starting point of any RAG system is its source data, often consisting of a vast corpus of text documents, websites, or databases
DataStax • Retrieval Augmented Generation (RAG) Explained: Understanding Key Concepts
NeMo Curator
NeMo Curator is a Python library specifically designed for scalable and efficient dataset preparation. It greatly accelerates data curation by leveraging GPUs with Dask and RAPIDS, resulting in significant time savings. The library provides a customizable and modular interface, simplifying pipeline expansion and accelerating model conv... See more
NeMo Curator is a Python library specifically designed for scalable and efficient dataset preparation. It greatly accelerates data curation by leveraging GPUs with Dask and RAPIDS, resulting in significant time savings. The library provides a customizable and modular interface, simplifying pipeline expansion and accelerating model conv... See more
GitHub - NVIDIA/NeMo-Curator: Scalable toolkit for data curation
Nicolay Gerold added
contribute to the solution, and together they make large-scale datasets more valuable.