Data Loading

Data Engineering The Open Data Stack Distilled into Four Core Tools

Carbon | Data Connectors for LLMs

WebDataset

CambioML GitHub - CambioML/uniflow-llm-based-pdf-extraction-text-cleaning-data-clustering: LLM-based text extraction from unstructured data like PDFs, Words and HTMLs. Transform and cluster the text into your desired format. Less information loss, more interpretation, and faster...

The Warehouse Native Customer Data Platform

huggingface GitHub - huggingface/datatrove: Freeing data processing from scripting madness by providing a set of platform-agnostic customizable pipeline processing blocks.

Instill AI

tensorlakeai GitHub - tensorlakeai/indexify: A scalable realtime and continuous indexing engine for Unstructured Data to build Generative AI Applications

Bill Mill notes.billmill.org