Data Loading
Your LLMs deserve better input.
Reader converts any URL to an LLM-friendly input with a simple prefix https://r.jina.ai/ . Get improved output for your agent and RAG systems at no cost.
Reader converts any URL to an LLM-friendly input with a simple prefix https://r.jina.ai/ . Get improved output for your agent and RAG systems at no cost.
- Live demo: https://jina.ai/reader
- Or just visit these URLs https://r.jina.ai/https://github.com/jina-ai/reader, https://r.jina.ai/https://x.com/elonmusk and see
jina-ai • jina-ai/reader: Convert any URL to an LLM-friendly input ... - GitHub
uniflow provides a unified LLM interface to extract and transform and raw documents.
- Document types: Uniflow enables data extraction from PDFs, HTMLs and TXTs.
- LLM agnostic: Uniflow supports most common-used LLMs for text tranformation, including
- OpenAI models (GPT3.5 and GPT4),
- Google Gemini models (Gemini 1.5, MultiModal),
- AWS BedRock models,
- Huggingf
CambioML • GitHub - CambioML/uniflow-llm-based-pdf-extraction-text-cleaning-data-clustering: LLM-based text extraction from unstructured data like PDFs, Words and HTMLs. Transform and cluster the text into your desired format. Less information loss, more interpretation, and faster...
Snowplow
1️⃣ Snowplow is made with developers in mind. It currently offers over 20 SDKs to get data from the web, mobile, and server-side applications.
2️⃣ The known feature of Snowplow is the use of its unique schema-based approach and validation process. Its architecture ensures reliable data.
3️⃣ Snowplow supports integration with multiple data... See more
1️⃣ Snowplow is made with developers in mind. It currently offers over 20 SDKs to get data from the web, mobile, and server-side applications.
2️⃣ The known feature of Snowplow is the use of its unique schema-based approach and validation process. Its architecture ensures reliable data.
3️⃣ Snowplow supports integration with multiple data... See more
Bap • Our 5 favourite open-source customer data platforms
The solution: The ingestion service
To meet these unique demands, the Search Infrastructure team implemented the Ingestion Service to gracefully handle Twitter’s traffic trends. The Ingestion Service queues requests from the client service into a single Kafka topic per Elasticsearch cluster. Worker clients then read from this topic and send the... See more
To meet these unique demands, the Search Infrastructure team implemented the Ingestion Service to gracefully handle Twitter’s traffic trends. The Ingestion Service queues requests from the client service into a single Kafka topic per Elasticsearch cluster. Worker clients then read from this topic and send the... See more
Stability and scalability for search
Open-Source Pre-Processing Tools for Unstructured Data
The unstructured library provides open-source components for ingesting and pre-processing images and text documents, such as PDFs, HTML, Word docs, and many more. The use cases of unstructured revolve around streamlining and optimizing the data processing workflow for LLMs. unstructured modular... See more
The unstructured library provides open-source components for ingesting and pre-processing images and text documents, such as PDFs, HTML, Word docs, and many more. The use cases of unstructured revolve around streamlining and optimizing the data processing workflow for LLMs. unstructured modular... See more
Unstructured-IO • GitHub - Unstructured-IO/unstructured: Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.
1️⃣ RudderStack provides data pipelines to collect data from applications, websites and SaaS platforms.
2️⃣ Its API architecture and SDKs ensure developers can gather data from different sources and leverage them into their applications without disruptions.
3️⃣ RudderStack is highly versatile and integrates with over 90+ tools and data warehouse... See more
2️⃣ Its API architecture and SDKs ensure developers can gather data from different sources and leverage them into their applications without disruptions.
3️⃣ RudderStack is highly versatile and integrates with over 90+ tools and data warehouse... See more
Bap • Our 5 favourite open-source customer data platforms
Marker
Marker converts PDF, EPUB, and MOBI to markdown. It's 10x faster than nougat, more accurate on most documents, and has low hallucination risk.
Marker converts PDF, EPUB, and MOBI to markdown. It's 10x faster than nougat, more accurate on most documents, and has low hallucination risk.
- Support for a range of PDF documents (optimized for books and scientific papers)
- Removes headers/footers/other artifacts
- Converts most equations to latex
- Formats code blocks and tables
- Support for multiple
VikParuchuri • GitHub - VikParuchuri/marker: Convert PDF to markdown quickly with high accuracy
Data Extraction Stack
This is a robust, locally hosted web-based PDF manipulation tool using Docker. It enables you to carry out various operations on PDF files, including splitting, merging, converting, reorganizing, adding images, rotating, compressing, and more. This locally hosted web application has evolved to encompass a comprehensive set of features, addressing... See more
GitHub - Stirling-Tools/Stirling-PDF: #1 Locally hosted web application that allows you to perform various operations on PDF files
Collect , unify , and activate customer data
RudderStack makes it easy to collect and send customer data to the tools and teams that need it
RudderStack makes it easy to collect and send customer data to the tools and teams that need it