Data Loading
Magika
Magika is a novel AI powered file type detection tool that relies on the recent advance of deep learning to provide accurate detection. Under the hood, Magika employs a custom, highly optimized Keras model that only weighs about 1MB, and enables precise file identification within milliseconds, even when running on a single CPU.
In an evaluati... See more
Magika is a novel AI powered file type detection tool that relies on the recent advance of deep learning to provide accurate detection. Under the hood, Magika employs a custom, highly optimized Keras model that only weighs about 1MB, and enables precise file identification within milliseconds, even when running on a single CPU.
In an evaluati... See more
google • GitHub - google/magika: Detect file content types with deep learning
The backbone for Versatile ai
Meet Instill Cloud, a no-code/low-code platform that accelerates AI application development by 10x. Effortlessly connect to diverse data sources, seamlessly integrate AI models, and deploy customized logic for your projects, no matter how complex, with lightning speed.
Meet Instill Cloud, a no-code/low-code platform that accelerates AI application development by 10x. Effortlessly connect to diverse data sources, seamlessly integrate AI models, and deploy customized logic for your projects, no matter how complex, with lightning speed.
Instill AI
Snowplow
1️⃣ Snowplow is made with developers in mind. It currently offers over 20 SDKs to get data from the web, mobile, and server-side applications.
2️⃣ The known feature of Snowplow is the use of its unique schema-based approach and validation process. Its architecture ensures reliable data.
3️⃣ Snowplow supports integration with multiple data st... See more
1️⃣ Snowplow is made with developers in mind. It currently offers over 20 SDKs to get data from the web, mobile, and server-side applications.
2️⃣ The known feature of Snowplow is the use of its unique schema-based approach and validation process. Its architecture ensures reliable data.
3️⃣ Snowplow supports integration with multiple data st... See more
Bap • Our 5 favourite open-source customer data platforms
Airbyte enables you to build data pipelines and replicate data from a source to a destination. You can configure how frequently the data is synced, what data is replicated, and how the data is written to in the destination.
This page describes the concepts you need to know to use Airbyte.
Source
A source is an API, file, database, or data warehouse t... See more
This page describes the concepts you need to know to use Airbyte.
Source
A source is an API, file, database, or data warehouse t... See more
Core Concepts | Airbyte Documentation
ETL
The part of the system I'm most proud of, and on which I spent the most effort, is the ETL process.
We had a series of shell scripts for each data source we ingested (there were many), which would pull the data and put it in an s3 bucket.
Then, early in the morning, a cron job would spin up an EC2 instance, which would pull in the latest ETL code... See more
The part of the system I'm most proud of, and on which I spent the most effort, is the ETL process.
We had a series of shell scripts for each data source we ingested (there were many), which would pull the data and put it in an s3 bucket.
Then, early in the morning, a cron job would spin up an EC2 instance, which would pull in the latest ETL code... See more
Bill Mill • notes.billmill.org
The solution: The ingestion service
To meet these unique demands, the Search Infrastructure team implemented the Ingestion Service to gracefully handle Twitter’s traffic trends. The Ingestion Service queues requests from the client service into a single Kafka topic per Elasticsearch cluster. Worker clients then read from this topic and send the req... See more
To meet these unique demands, the Search Infrastructure team implemented the Ingestion Service to gracefully handle Twitter’s traffic trends. The Ingestion Service queues requests from the client service into a single Kafka topic per Elasticsearch cluster. Worker clients then read from this topic and send the req... See more
Stability and scalability for search
Surya
Surya is a document OCR toolkit that does:
Surya is a document OCR toolkit that does:
- OCR in 90+ languages that benchmarks favorably vs cloud services
- Line-level text detection in any language
- Layout analysis (table, image, header, etc detection)
- Reading order detection
GitHub - VikParuchuri/surya: OCR, layout analysis, reading order, line detection in 90+ languages
WebDataset
WebDataset is a library for writing I/O pipelines for large datasets. Its sequential I/O and sharding features make it especially useful for streaming large-scale datasets to a DataLoader.
The WebDataset format
A WebDataset file is a TAR archive containing a series of data files. All successive data files with the same prefix are consider... See more
WebDataset is a library for writing I/O pipelines for large datasets. Its sequential I/O and sharding features make it especially useful for streaming large-scale datasets to a DataLoader.
The WebDataset format
A WebDataset file is a TAR archive containing a series of data files. All successive data files with the same prefix are consider... See more
WebDataset
Easily chunk complex documents the same way a human would.
Chunking documents is a challenging task that underpins any RAG system. High quality results are critical to a sucessful AI application, yet most open-source libraries are limited in their ability to handle complex documents.
Open Parse is designed to fill this gap by providing a flexible, e... See more
Chunking documents is a challenging task that underpins any RAG system. High quality results are critical to a sucessful AI application, yet most open-source libraries are limited in their ability to handle complex documents.
Open Parse is designed to fill this gap by providing a flexible, e... See more