Data Loading

google GitHub - google/magika: Detect file content types with deep learning

Instill AI

Bap Our 5 favourite open-source customer data platforms

Core Concepts | Airbyte Documentation

Bill Mill notes.billmill.org

Stability and scalability for search

GitHub - VikParuchuri/surya: OCR, layout analysis, reading order, line detection in 90+ languages

WebDataset

Filimoa GitHub - Filimoa/open-parse: Improved file parsing for LLM’s