Datasets
Repository for the paper "The CoT Collection: Improving Zero-shot and Few-shot Learning of Language Models via Chain-of-Thought Fine-Tuning", including 1.84M CoT rationales extracted across 1,060 tasks"
Paper Link : https://arxiv.org/abs/2305.14045
Paper Link : https://arxiv.org/abs/2305.14045
kaistAI • GitHub - kaistAI/CoT-Collection: [Under Review] The CoT Collection: Improving Zero-shot and Few-shot Learning of Language Models via Chain-of-Thought Fine-Tuning
ExpertQA - Expert-Curated Questions and Attributed Answers.
- 484 participants across 32 fields of study
- Experts evaluate and revise responses from LLMs
- Verified answers and attributions
- 484 participants across 32 fields of study
- Experts evaluate and revise responses from LLMs
- Verified answers and attributions
twitter.com • (1) Home / X
RedPajama-V2 is an open dataset for training large language models. The dataset includes over 100B text
documents coming from 84 CommonCrawl snapshots and processed using
the CCNet pipeline. Out of these, there are 30B documents in the corpus
that additionally come with quality signals. In addition, we also provide the ids of duplicated documents... See more
documents coming from 84 CommonCrawl snapshots and processed using
the CCNet pipeline. Out of these, there are 30B documents in the corpus
that additionally come with quality signals. In addition, we also provide the ids of duplicated documents... See more
togethercomputer/RedPajama-Data-V2 · Datasets at Hugging Face
This dataset is an attempt to replicate the results of Microsoft's Orca
Our dataset consists of:
~1 million of FLANv2 augmented with GPT-4 completions (flan1m-alpaca-uncensored.jsonl)
~3.5 million of FLANv2 augmented with GPT-3.5 completions (flan5m-alpaca-uncensored.jsonl)
We followed the submix and system prompt distribution outlined in the Orca... See more
Our dataset consists of:
~1 million of FLANv2 augmented with GPT-4 completions (flan1m-alpaca-uncensored.jsonl)
~3.5 million of FLANv2 augmented with GPT-3.5 completions (flan5m-alpaca-uncensored.jsonl)
We followed the submix and system prompt distribution outlined in the Orca... See more
ehartford/dolphin · Datasets at Hugging Face
TabLib
Access on Hugging Face
🤗
(Sample, Full Dataset)
Read the Paper (TabLib)
Introduction
Huge datasets have been critical for the performance of AI models for text and images. Similar advancements can be made for tabular data—which consists of tables consisting of rows and columns—but the research community needs a bigger and more diverse... See more
Access on Hugging Face
🤗
(Sample, Full Dataset)
Read the Paper (TabLib)
Introduction
Huge datasets have been critical for the performance of AI models for text and images. Similar advancements can be made for tabular data—which consists of tables consisting of rows and columns—but the research community needs a bigger and more diverse... See more