This dataset is an attempt to replicate the results of Microsoft's Orca
Our dataset consists of:
~1 million of FLANv2 augmented with GPT-4 completions (flan1m-alpaca-uncensored.jsonl)
~3.5 million of FLANv2 augmented with GPT-3.5 completions (flan5m-alpaca-uncensored.jsonl)
We followed the submix and system prompt distribution outlined in the Orca pape... See more