GitHub - CambioML/uniflow-llm-based-pdf-extraction-text-cleaning-data-clustering: LLM-based text extraction from unstructured data like PDFs, Words and HTMLs. Transform and cluster the text into your desired format. Less information loss, more interpretation, and faster...

GitHub - CambioML/uniflow-llm-based-pdf-extraction-text-cleaning-data-clustering: LLM-based text extraction from unstructured data like PDFs, Words and HTMLs. Transform and cluster the text into your desired format. Less information loss, more interpretation, and faster...

CambioML github.com

RelatedHighlights

Indexify - Extraction and Retrieval from Videos, PDF and Audio for Interactive AI Applications

LLM applications backed by Indexify will never answer outdated information.

Indexify is an open-source engine for buidling fast data pipelines for unstructured data(video, audio, images and documents) using re-usable extractors for embedding, transformatio... See more

tensorlakeai • GitHub - tensorlakeai/indexify: A scalable realtime and continuous indexing engine for Unstructured Data to build Generative AI Applications

Nicolay Gerold added

Data-Juicer: A One-Stop Data Processing System for Large Language Models

Data-Juicer is a one-stop data processing system to make data higher-quality, juicier, and more digestible for LLMs. This project is being actively updated and maintained, and we will periodically enhance and add more features and data recipes. We welcome you to join us in pro... See more

alibaba • GitHub - alibaba/data-juicer: A one-stop data processing system to make data higher-quality, juicier, and more digestible for LLMs! 🍎 🍋 🌽 ➡️ ➡️🍸 🍹 🍷为大语言模型提供更高质量、更丰富、更易”消化“的数据！

Nicolay Gerold added

To train LLMs, you need data that is:

Large — Sufficiently large LMs require trillions of tokens.

Clean — Noisy data reduces performance.

Diverse — Data should come from different sources and different knowledge bases.

What does clean data look like?

You can de-duplicate data with simple heuristics. The most basic would be removing any exact duplicates ... See more

Shortwave — rajhesh.panchanadhan@gmail.com [Gmail alternative]

Nicolay Gerold added

Unites Research and Production — AdalFlow: The Library to Build and Auto-Optimize LLM Task Pipelines

adalflow.sylph.ai

added

DataTrove

DataTrove is a library to process, filter and deduplicate text data at a very large scale. It provides a set of prebuilt commonly used processing blocks with a framework to easily add custom functionality.

DataTrove processing pipelines are platform-agnostic, running out of the box locally or on a slurm cluster. Its (relatively) low memory... See more

huggingface • GitHub - huggingface/datatrove: Freeing data processing from scripting madness by providing a set of platform-agnostic customizable pipeline processing blocks.

Nicolay Gerold added

Indexify is a reactive structured extraction engine for un-structured data.

Applications leveraging LLMs for autonomous planning or queries necessitate timely index updates aligned with data changes or new extraction methods. Indexify enables both, by applying feature extractors on data in real-time and updating one or many indexes.

Why use Indexify

tensorlakeai • GitHub - tensorlakeai/indexify: A scalable realtime and continuous indexing engine for Unstructured Data to build Generative AI Applications

Nicolay Gerold added

Connect external data

to LLMs , no matter the source.

The universal retrieval engine for LLMs to access unstructured data from any source.

Carbon | Data Connectors for LLMs

Nicolay Gerold added

Clean & curate your data with LLMs

databonsai is a Python library that uses LLMs to perform data cleaning tasks.

Features

Suite of tools for data processing using LLMs including categorization, transformation, and extraction

Validation of LLM outputs

Batch processing for token savings

Retry logic with exponential backoff for handling rate limits an

databonsai • GitHub - databonsai/databonsai: clean & curate your data with LLMs.

Nicolay Gerold added