WebDataset
Unitxt is a python library for getting data fired up and set for utilization. In one line of code, it preps a dataset or mixtures-of-datasets into an input-output format for training and evaluation. We aspire to be simple, adaptable and transparent.
Unitxt builds on separation. Separation allows adding a dataset, without knowing anything about the... See more
Unitxt builds on separation. Separation allows adding a dataset, without knowing anything about the... See more
IBM • GitHub - IBM/unitxt: 🦄 Unitxt: a python library for getting data fired up and set for training and evaluation
NeMo Curator
NeMo Curator is a Python library specifically designed for scalable and efficient dataset preparation. It greatly accelerates data curation by leveraging GPUs with Dask and RAPIDS, resulting in significant time savings. The library provides a customizable and modular interface, simplifying pipeline expansion and accelerating model... See more
NeMo Curator is a Python library specifically designed for scalable and efficient dataset preparation. It greatly accelerates data curation by leveraging GPUs with Dask and RAPIDS, resulting in significant time savings. The library provides a customizable and modular interface, simplifying pipeline expansion and accelerating model... See more
GitHub - NVIDIA/NeMo-Curator: Scalable toolkit for data curation
TabLib
Access on Hugging Face
🤗
(Sample, Full Dataset)
Read the Paper (TabLib)
Introduction
Huge datasets have been critical for the performance of AI models for text and images. Similar advancements can be made for tabular data—which consists of tables consisting of rows and columns—but the research community needs a bigger and more diverse... See more
Access on Hugging Face
🤗
(Sample, Full Dataset)
Read the Paper (TabLib)
Introduction
Huge datasets have been critical for the performance of AI models for text and images. Similar advancements can be made for tabular data—which consists of tables consisting of rows and columns—but the research community needs a bigger and more diverse... See more