GitHub - CambioML/uniflow-llm-based-pdf-extraction-text-cleaning-data-clustering: LLM-based text extraction from unstructured data like PDFs, Words and HTMLs. Transform and cluster the text into your desired format. Less information loss, more interpretation, and faster...

GitHub - CambioML/uniflow-llm-based-pdf-extraction-text-cleaning-data-clustering: LLM-based text extraction from unstructured data like PDFs, Words and HTMLs. Transform and cluster the text into your desired format. Less information loss, more interpretation, and faster...

CambioMLgithub.com
Thumbnail of GitHub - CambioML/uniflow-llm-based-pdf-extraction-text-cleaning-data-clustering: LLM-based text extraction from unstructured data like PDFs, Words and HTMLs. Transform and cluster the text into your desired format. Less information loss, more interpretation, and faster...

tensorlakeai GitHub - tensorlakeai/indexify: A scalable realtime and continuous indexing engine for Unstructured Data to build Generative AI Applications

alibaba GitHub - alibaba/data-juicer: A one-stop data processing system to make data higher-quality, juicier, and more digestible for LLMs! 🍎 🍋 🌽 ➡️ ➡️🍸 🍹 🍷为大语言模型提供更高质量、更丰富、更易”消化“的数据!

Shortwave — rajhesh.panchanadhan@gmail.com [Gmail alternative]

Unites Research and Production — AdalFlow: The Library to Build and Auto-Optimize LLM Task Pipelines

adalflow.sylph.ai
Thumbnail of Unites Research and Production — AdalFlow: The Library to Build and Auto-Optimize LLM Task Pipelines

added

huggingface GitHub - huggingface/datatrove: Freeing data processing from scripting madness by providing a set of platform-agnostic customizable pipeline processing blocks.

tensorlakeai GitHub - tensorlakeai/indexify: A scalable realtime and continuous indexing engine for Unstructured Data to build Generative AI Applications

Carbon | Data Connectors for LLMs

databonsai GitHub - databonsai/databonsai: clean & curate your data with LLMs.