Data Loading

GitHub - Stirling-Tools/Stirling-PDF: #1 Locally hosted web application that allows you to perform various operations on PDF files

Stability and scalability for search

Carbon | Data Connectors for LLMs

huggingface GitHub - huggingface/datatrove: Freeing data processing from scripting madness by providing a set of platform-agnostic customizable pipeline processing blocks.

samuelcolvin GitHub - samuelcolvin/watchfiles: Simple, modern and fast file watching and code reload in python.

VikParuchuri GitHub - VikParuchuri/marker: Convert PDF to markdown quickly with high accuracy

Unstructured-IO GitHub - Unstructured-IO/unstructured: Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.

The Warehouse Native Customer Data Platform

WebDataset