TabLib

RelatedHighlights

What is Pandalyst

Pandalyst is a general large language model specifically trained to process and analyze data using the pandas library.

How is Pandalyst

Pandalyst has strong generalization capabilities for data tables in different fields and different data analysis needs.

Why is Pandalyst

Pandalyst is open source and free to use, and its small paramete... See more

pipizhao/Pandalyst-7B-V1.2 · Hugging Face

Nicolay Gerold added

DataTrove

DataTrove is a library to process, filter and deduplicate text data at a very large scale. It provides a set of prebuilt commonly used processing blocks with a framework to easily add custom functionality.

DataTrove processing pipelines are platform-agnostic, running out of the box locally or on a slurm cluster. Its (relatively) low memory... See more

huggingface • GitHub - huggingface/datatrove: Freeing data processing from scripting madness by providing a set of platform-agnostic customizable pipeline processing blocks.

Nicolay Gerold added

RedPajama-V2 is an open dataset for training large language models. The dataset includes over 100B text

documents coming from 84 CommonCrawl snapshots and processed using

the CCNet pipeline. Out of these, there are 30B documents in the corpus

that additionally come with quality signals. In addition, we also provide the ids of duplicated documents whic... See more

togethercomputer/RedPajama-Data-V2 · Datasets at Hugging Face

Nicolay Gerold added

NeMo Curator

NeMo Curator is a Python library specifically designed for scalable and efficient dataset preparation. It greatly accelerates data curation by leveraging GPUs with Dask and RAPIDS, resulting in significant time savings. The library provides a customizable and modular interface, simplifying pipeline expansion and accelerating model conv... See more

GitHub - NVIDIA/NeMo-Curator: Scalable toolkit for data curation

Nicolay Gerold added

Text embeddings are a critical piece of many pipelines, from search, to RAG, to vector databases and more. Most embedding models are BERT/Transformer-based and typically have short context lengths (e.g., 512). That’s only about two pages of text, but documents can be very long – books, legal cases, TV screenplays, code repositories, etc can be tens... See more

Long-Context Retrieval Models with Monarch Mixer

Nicolay Gerold added

Repository for the paper "The CoT Collection: Improving Zero-shot and Few-shot Learning of Language Models via Chain-of-Thought Fine-Tuning", including 1.84M CoT rationales extracted across 1,060 tasks"

Paper Link : https://arxiv.org/abs/2305.14045

kaistAI • GitHub - kaistAI/CoT-Collection: [Under Review] The CoT Collection: Improving Zero-shot and Few-shot Learning of Language Models via Chain-of-Thought Fine-Tuning

Nicolay Gerold added

The dataset was constructed by pooling 60 existing robot datasets from 34 robotic research labs around the world.

Open X-Embodiment: Robotic Learning Datasets and RT-X Models

Darren LI added

Following the open data movement, often embracing a not-for-profit philosophy, many data sets are available online from fields like biodiversity, business, cartography, chemistry, genomics, and medicine. Look at one central index, www.kdnuggets.com/datasets, and you’ll see what amounts to lists of lists of data resources.

pipizhao/Pandalyst-7B-V1.2 · Hugging Face

huggingface • GitHub - huggingface/datatrove: Freeing data processing from scripting madness by providing a set of platform-agnostic customizable pipeline processing blocks.

togethercomputer/RedPajama-Data-V2 · Datasets at Hugging Face

GitHub - NVIDIA/NeMo-Curator: Scalable toolkit for data curation

Long-Context Retrieval Models with Monarch Mixer

kaistAI • GitHub - kaistAI/CoT-Collection: [Under Review] The CoT Collection: Improving Zero-shot and Few-shot Learning of Language Models via Chain-of-Thought Fine-Tuning

Open X-Embodiment: Robotic Learning Datasets and RT-X Models

Eric Siegel • Predictive Analytics