TabLib
What is Pandalyst
Pandalyst is a general large language model specifically trained to process and analyze data using the pandas library.
How is Pandalyst
Pandalyst has strong generalization capabilities for data tables in different fields and different data analysis needs.
Why is Pandalyst
Pandalyst is open source and free to use, and its small paramete... See more
Pandalyst is a general large language model specifically trained to process and analyze data using the pandas library.
How is Pandalyst
Pandalyst has strong generalization capabilities for data tables in different fields and different data analysis needs.
Why is Pandalyst
Pandalyst is open source and free to use, and its small paramete... See more
pipizhao/Pandalyst-7B-V1.2 · Hugging Face
Nicolay Gerold added
DataTrove
DataTrove is a library to process, filter and deduplicate text data at a very large scale. It provides a set of prebuilt commonly used processing blocks with a framework to easily add custom functionality.
DataTrove processing pipelines are platform-agnostic, running out of the box locally or on a slurm cluster. Its (relatively) low memory... See more
DataTrove is a library to process, filter and deduplicate text data at a very large scale. It provides a set of prebuilt commonly used processing blocks with a framework to easily add custom functionality.
DataTrove processing pipelines are platform-agnostic, running out of the box locally or on a slurm cluster. Its (relatively) low memory... See more
huggingface • GitHub - huggingface/datatrove: Freeing data processing from scripting madness by providing a set of platform-agnostic customizable pipeline processing blocks.
Nicolay Gerold added
RedPajama-V2 is an open dataset for training large language models. The dataset includes over 100B text
documents coming from 84 CommonCrawl snapshots and processed using
the CCNet pipeline. Out of these, there are 30B documents in the corpus
that additionally come with quality signals. In addition, we also provide the ids of duplicated documents whic... See more
documents coming from 84 CommonCrawl snapshots and processed using
the CCNet pipeline. Out of these, there are 30B documents in the corpus
that additionally come with quality signals. In addition, we also provide the ids of duplicated documents whic... See more
togethercomputer/RedPajama-Data-V2 · Datasets at Hugging Face
Nicolay Gerold added
NeMo Curator
NeMo Curator is a Python library specifically designed for scalable and efficient dataset preparation. It greatly accelerates data curation by leveraging GPUs with Dask and RAPIDS, resulting in significant time savings. The library provides a customizable and modular interface, simplifying pipeline expansion and accelerating model conv... See more
NeMo Curator is a Python library specifically designed for scalable and efficient dataset preparation. It greatly accelerates data curation by leveraging GPUs with Dask and RAPIDS, resulting in significant time savings. The library provides a customizable and modular interface, simplifying pipeline expansion and accelerating model conv... See more
GitHub - NVIDIA/NeMo-Curator: Scalable toolkit for data curation
Nicolay Gerold added
Text embeddings are a critical piece of many pipelines, from search, to RAG, to vector databases and more. Most embedding models are BERT/Transformer-based and typically have short context lengths (e.g., 512). That’s only about two pages of text, but documents can be very long – books, legal cases, TV screenplays, code repositories, etc can be tens... See more
Long-Context Retrieval Models with Monarch Mixer
Nicolay Gerold added
Repository for the paper "The CoT Collection: Improving Zero-shot and Few-shot Learning of Language Models via Chain-of-Thought Fine-Tuning", including 1.84M CoT rationales extracted across 1,060 tasks"
Paper Link : https://arxiv.org/abs/2305.14045
Paper Link : https://arxiv.org/abs/2305.14045
kaistAI • GitHub - kaistAI/CoT-Collection: [Under Review] The CoT Collection: Improving Zero-shot and Few-shot Learning of Language Models via Chain-of-Thought Fine-Tuning
Nicolay Gerold added
The dataset was constructed by pooling 60 existing robot datasets from 34 robotic research labs around the world.
Open X-Embodiment: Robotic Learning Datasets and RT-X Models
Darren LI added
Following the open data movement, often embracing a not-for-profit philosophy, many data sets are available online from fields like biodiversity, business, cartography, chemistry, genomics, and medicine. Look at one central index, www.kdnuggets.com/datasets, and you’ll see what amounts to lists of lists of data resources.