WebDataset

Let me explain why Hugging Face Datasets storage is faster than S3 + why today's release changes everything 🧵 https://t.co/HvCj5NvbLv
Unitxt is a python library for getting data fired up and set for utilization. In one line of code, it preps a dataset or mixtures-of-datasets into an input-output format for training and evaluation. We aspire to be simple, adaptable and transparent.
Unitxt builds on separation. Separation allows adding a dataset, without knowing anything about the... See more
Unitxt builds on separation. Separation allows adding a dataset, without knowing anything about the... See more
IBM • GitHub - IBM/unitxt: 🦄 Unitxt: a python library for getting data fired up and set for training and evaluation
NeMo Curator
NeMo Curator is a Python library specifically designed for scalable and efficient dataset preparation. It greatly accelerates data curation by leveraging GPUs with Dask and RAPIDS, resulting in significant time savings. The library provides a customizable and modular interface, simplifying pipeline expansion and accelerating model... See more
NeMo Curator is a Python library specifically designed for scalable and efficient dataset preparation. It greatly accelerates data curation by leveraging GPUs with Dask and RAPIDS, resulting in significant time savings. The library provides a customizable and modular interface, simplifying pipeline expansion and accelerating model... See more