Data Loading
ETL
The part of the system I'm most proud of, and on which I spent the most effort, is the ETL process.
We had a series of shell scripts for each data source we ingested (there were many), which would pull the data and put it in an s3 bucket.
Then, early in the morning, a cron job would spin up an EC2 instance, which would pull in the latest ETL code... See more
The part of the system I'm most proud of, and on which I spent the most effort, is the ETL process.
We had a series of shell scripts for each data source we ingested (there were many), which would pull the data and put it in an s3 bucket.
Then, early in the morning, a cron job would spin up an EC2 instance, which would pull in the latest ETL code... See more
Bill Mill • notes.billmill.org
Nicolay Gerold added 4mo
Surya
Surya is a document OCR toolkit that does:
Surya is a document OCR toolkit that does:
- OCR in 90+ languages that benchmarks favorably vs cloud services
- Line-level text detection in any language
- Layout analysis (table, image, header, etc detection)
- Reading order detection
GitHub - VikParuchuri/surya: OCR, layout analysis, reading order, line detection in 90+ languages
Nicolay Gerold added 5mo
The solution: The ingestion service
To meet these unique demands, the Search Infrastructure team implemented the Ingestion Service to gracefully handle Twitter’s traffic trends. The Ingestion Service queues requests from the client service into a single Kafka topic per Elasticsearch cluster. Worker clients then read from this topic and send the req... See more
To meet these unique demands, the Search Infrastructure team implemented the Ingestion Service to gracefully handle Twitter’s traffic trends. The Ingestion Service queues requests from the client service into a single Kafka topic per Elasticsearch cluster. Worker clients then read from this topic and send the req... See more
Stability and scalability for search
Nicolay Gerold added 6mo
Indexify - Extraction and Retrieval from Videos, PDF and Audio for Interactive AI Applications
Indexify is an open-source engine for buidling fast data pipelines for unstructured data(video, audio, images and documents) using re-usable extractors for embedding, transformatio... See more
LLM applications backed by Indexify will never answer outdated information.
Indexify is an open-source engine for buidling fast data pipelines for unstructured data(video, audio, images and documents) using re-usable extractors for embedding, transformatio... See more
tensorlakeai • GitHub - tensorlakeai/indexify: A scalable realtime and continuous indexing engine for Unstructured Data to build Generative AI Applications
Nicolay Gerold added 6mo
This is a robust, locally hosted web-based PDF manipulation tool using Docker. It enables you to carry out various operations on PDF files, including splitting, merging, converting, reorganizing, adding images, rotating, compressing, and more. This locally hosted web application has evolved to encompass a comprehensive set of features, addressing a... See more
GitHub - Stirling-Tools/Stirling-PDF: #1 Locally hosted web application that allows you to perform various operations on PDF files
Nicolay Gerold added 7mo
Snowplow
1️⃣ Snowplow is made with developers in mind. It currently offers over 20 SDKs to get data from the web, mobile, and server-side applications.
2️⃣ The known feature of Snowplow is the use of its unique schema-based approach and validation process. Its architecture ensures reliable data.
3️⃣ Snowplow supports integration with multiple data st... See more
1️⃣ Snowplow is made with developers in mind. It currently offers over 20 SDKs to get data from the web, mobile, and server-side applications.
2️⃣ The known feature of Snowplow is the use of its unique schema-based approach and validation process. Its architecture ensures reliable data.
3️⃣ Snowplow supports integration with multiple data st... See more
Bap • Our 5 favourite open-source customer data platforms
Nicolay Gerold added 7mo
1️⃣ RudderStack provides data pipelines to collect data from applications, websites and SaaS platforms.
2️⃣ Its API architecture and SDKs ensure developers can gather data from different sources and leverage them into their applications without disruptions.
3️⃣ RudderStack is highly versatile and integrates with over 90+ tools and data warehouse dest... See more
2️⃣ Its API architecture and SDKs ensure developers can gather data from different sources and leverage them into their applications without disruptions.
3️⃣ RudderStack is highly versatile and integrates with over 90+ tools and data warehouse dest... See more
Bap • Our 5 favourite open-source customer data platforms
Nicolay Gerold added 7mo
Your LLMs deserve better input.
Reader converts any URL to an LLM-friendly input with a simple prefix https://r.jina.ai/ . Get improved output for your agent and RAG systems at no cost.
Reader converts any URL to an LLM-friendly input with a simple prefix https://r.jina.ai/ . Get improved output for your agent and RAG systems at no cost.
- Live demo: https://jina.ai/reader
- Or just visit these URLs https://r.jina.ai/https://github.com/jina-ai/reader, https://r.jina.ai/https://x.com/elonmusk and see yourse
jina-ai • jina-ai/reader: Convert any URL to an LLM-friendly input ... - GitHub
Nicolay Gerold added 7mo
Easily chunk complex documents the same way a human would.
Chunking documents is a challenging task that underpins any RAG system. High quality results are critical to a sucessful AI application, yet most open-source libraries are limited in their ability to handle complex documents.
Open Parse is designed to fill this gap by providing a flexible, e... See more
Chunking documents is a challenging task that underpins any RAG system. High quality results are critical to a sucessful AI application, yet most open-source libraries are limited in their ability to handle complex documents.
Open Parse is designed to fill this gap by providing a flexible, e... See more
Filimoa • GitHub - Filimoa/open-parse: Improved file parsing for LLM’s
Nicolay Gerold added 7mo
uniflow provides a unified LLM interface to extract and transform and raw documents.
- Document types: Uniflow enables data extraction from PDFs, HTMLs and TXTs.
- LLM agnostic: Uniflow supports most common-used LLMs for text tranformation, including
- OpenAI models (GPT3.5 and GPT4),
- Google Gemini models (Gemini 1.5, MultiModal),
- AWS BedRock models,
- Huggingf
CambioML • GitHub - CambioML/uniflow-llm-based-pdf-extraction-text-cleaning-data-clustering: LLM-based text extraction from unstructured data like PDFs, Words and HTMLs. Transform and cluster the text into your desired format. Less information loss, more interpretation, and faster...
Nicolay Gerold added 8mo