Datasets

Repository for the paper "The CoT Collection: Improving Zero-shot and Few-shot Learning of Language Models via Chain-of-Thought Fine-Tuning", including 1.84M CoT rationales extracted across 1,060 tasks"

Paper Link : https://arxiv.org/abs/2305.14045

kaistAI • GitHub - kaistAI/CoT-Collection: [Under Review] The CoT Collection: Improving Zero-shot and Few-shot Learning of Language Models via Chain-of-Thought Fine-Tuning

ExpertQA - Expert-Curated Questions and Attributed Answers.

- 484 participants across 32 fields of study

- Experts evaluate and revise responses from LLMs

- Verified answers and attributions

twitter.com • (1) Home / X

RedPajama-V2 is an open dataset for training large language models. The dataset includes over 100B text

documents coming from 84 CommonCrawl snapshots and processed using

the CCNet pipeline. Out of these, there are 30B documents in the corpus

that additionally come with quality signals. In addition, we also provide the ids of duplicated documents... See more

togethercomputer/RedPajama-Data-V2 · Datasets at Hugging Face

This dataset is an attempt to replicate the results of Microsoft's Orca

Our dataset consists of:

~1 million of FLANv2 augmented with GPT-4 completions (flan1m-alpaca-uncensored.jsonl)

~3.5 million of FLANv2 augmented with GPT-3.5 completions (flan5m-alpaca-uncensored.jsonl)

We followed the submix and system prompt distribution outlined in the Orca... See more

ehartford/dolphin · Datasets at Hugging Face

TabLib

Access on Hugging Face

🤗

(Sample, Full Dataset)

Read the Paper (TabLib)

Introduction

Huge datasets have been critical for the performance of AI models for text and images. Similar advancements can be made for tabular data—which consists of tables consisting of rows and columns—but the research community needs a bigger and more diverse... See more