Data Loading

jina-ai jina-ai/reader: Convert any URL to an LLM-friendly input ... - GitHub

CambioML GitHub - CambioML/uniflow-llm-based-pdf-extraction-text-cleaning-data-clustering: LLM-based text extraction from unstructured data like PDFs, Words and HTMLs. Transform and cluster the text into your desired format. Less information loss, more interpretation, and faster...

Bap Our 5 favourite open-source customer data platforms

Stability and scalability for search

Unstructured-IO GitHub - Unstructured-IO/unstructured: Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.

Bap Our 5 favourite open-source customer data platforms

VikParuchuri GitHub - VikParuchuri/marker: Convert PDF to markdown quickly with high accuracy

GitHub - Stirling-Tools/Stirling-PDF: #1 Locally hosted web application that allows you to perform various operations on PDF files

The Warehouse Native Customer Data Platform