GitHub - alibaba/data-juicer: A one-stop data processing system to make data higher-quality, juicier, and more digestible for LLMs! 🍎 🍋 🌽 ➡️ ➡️🍸 🍹 🍷为大语言模型提供更高质量、更丰富、更易”消化“的数据！

GitHub - alibaba/data-juicer: A one-stop data processing system to make data higher-quality, juicier, and more digestible for LLMs! 🍎 🍋 🌽 ➡️ ➡️🍸 🍹 🍷为大语言模型提供更高质量、更丰富、更易”消化“的数据！

alibaba github.com

RelatedHighlights

Original creator : Jesse Zhang (GH: emptycrown, Twitter: @thejessezhang), who courteously donated the repo to LlamaIndex!

This is a simple library of all the data loaders / readers / tools / llama-packs that have been created by the community. The goal is to make it extremely easy to connect large language models to a large variety of knowledge sour... See more

GitHub - run-llama/llama-hub: A library of data loaders for LLMs made by the community -- to be used with LlamaIndex and/or LangChain

Nicolay Gerold added

GitHub - mit-han-lab/streaming-llm: Efficient Streaming Language Models with Attention Sinks

mit-han-lab github.com

Darren LI and added

一款适合需要处理大量非结构化AI数据的工具：datachain，支持多模态API调用和本地AI推理并行处理可以用它结合大模型一起使用，来完成复杂的数据分析任务，比如做，数据处理和清洗、LLM分析和验证、图像分割等支持图片、视频、文本、PDF、JSON、CSV、parquet等智能统一管理，自动保存处理记录和版本可以直接从各种云存储，比如谷歌云、亚马逊云或者本地读取数据，无需手动复制支持智能搜索和分析，支持超大数据集处理、并行处理 github：https://t.co/bJ9rKKuMBY... See more

AIGCLINK

x.com

added

GitHub - FlowiseAI/Flowise: Drag & drop UI to build your customized LLM flow

github.com

Andrés added

DataTrove

DataTrove is a library to process, filter and deduplicate text data at a very large scale. It provides a set of prebuilt commonly used processing blocks with a framework to easily add custom functionality.

DataTrove processing pipelines are platform-agnostic, running out of the box locally or on a slurm cluster. Its (relatively) low memory... See more

huggingface • GitHub - huggingface/datatrove: Freeing data processing from scripting madness by providing a set of platform-agnostic customizable pipeline processing blocks.

Nicolay Gerold added

Clean & curate your data with LLMs

databonsai is a Python library that uses LLMs to perform data cleaning tasks.

Features

Suite of tools for data processing using LLMs including categorization, transformation, and extraction

Validation of LLM outputs

Batch processing for token savings

Retry logic with exponential backoff for handling rate limits an

databonsai • GitHub - databonsai/databonsai: clean & curate your data with LLMs.

Nicolay Gerold added

uniflow provides a unified LLM interface to extract and transform and raw documents.

Document types: Uniflow enables data extraction from PDFs, HTMLs and TXTs.

LLM agnostic: Uniflow supports most common-used LLMs for text tranformation, including
- OpenAI models (GPT3.5 and GPT4),
- Google Gemini models (Gemini 1.5, MultiModal),
- AWS BedRock models,
- Huggingf

CambioML • GitHub - CambioML/uniflow-llm-based-pdf-extraction-text-cleaning-data-clustering: LLM-based text extraction from unstructured data like PDFs, Words and HTMLs. Transform and cluster the text into your desired format. Less information loss, more interpretation, and faster...

Nicolay Gerold added

To train LLMs, you need data that is:

Large — Sufficiently large LMs require trillions of tokens.

Clean — Noisy data reduces performance.

Diverse — Data should come from different sources and different knowledge bases.

What does clean data look like?

You can de-duplicate data with simple heuristics. The most basic would be removing any exact duplicates ... See more

Shortwave — rajhesh.panchanadhan@gmail.com [Gmail alternative]

Nicolay Gerold added