Data Loading
The backbone for Versatile ai
Meet Instill Cloud, a no-code/low-code platform that accelerates AI application development by 10x. Effortlessly connect to diverse data sources, seamlessly integrate AI models, and deploy customized logic for your projects, no matter how complex, with lightning speed.
Meet Instill Cloud, a no-code/low-code platform that accelerates AI application development by 10x. Effortlessly connect to diverse data sources, seamlessly integrate AI models, and deploy customized logic for your projects, no matter how complex, with lightning speed.
Instill AI
Nicolay Gerold added 9mo
Open-Source Pre-Processing Tools for Unstructured Data
The unstructured library provides open-source components for ingesting and pre-processing images and text documents, such as PDFs, HTML, Word docs, and many more. The use cases of unstructured revolve around streamlining and optimizing the data processing workflow for LLMs. unstructured modular... See more
The unstructured library provides open-source components for ingesting and pre-processing images and text documents, such as PDFs, HTML, Word docs, and many more. The use cases of unstructured revolve around streamlining and optimizing the data processing workflow for LLMs. unstructured modular... See more
Unstructured-IO • GitHub - Unstructured-IO/unstructured: Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.
Nicolay Gerold added 10mo
Connect external data
to LLMs , no matter the source.
The universal retrieval engine for LLMs to access unstructured data from any source.
to LLMs , no matter the source.
The universal retrieval engine for LLMs to access unstructured data from any source.
Carbon | Data Connectors for LLMs
Nicolay Gerold added 9mo
ETL
The part of the system I'm most proud of, and on which I spent the most effort, is the ETL process.
We had a series of shell scripts for each data source we ingested (there were many), which would pull the data and put it in an s3 bucket.
Then, early in the morning, a cron job would spin up an EC2 instance, which would pull in the latest ETL code... See more
The part of the system I'm most proud of, and on which I spent the most effort, is the ETL process.
We had a series of shell scripts for each data source we ingested (there were many), which would pull the data and put it in an s3 bucket.
Then, early in the morning, a cron job would spin up an EC2 instance, which would pull in the latest ETL code... See more
Bill Mill • notes.billmill.org
Nicolay Gerold added 4mo
Your LLMs deserve better input.
Reader converts any URL to an LLM-friendly input with a simple prefix https://r.jina.ai/ . Get improved output for your agent and RAG systems at no cost.
Reader converts any URL to an LLM-friendly input with a simple prefix https://r.jina.ai/ . Get improved output for your agent and RAG systems at no cost.
- Live demo: https://jina.ai/reader
- Or just visit these URLs https://r.jina.ai/https://github.com/jina-ai/reader, https://r.jina.ai/https://x.com/elonmusk and see yourse
jina-ai • jina-ai/reader: Convert any URL to an LLM-friendly input ... - GitHub
Nicolay Gerold added 7mo
watchfiles
Simple, modern and high performance file watching and code reload in python.
Documentation : watchfiles.helpmanual.io
Source Code : github.com/samuelcolvin/watchfiles
Underlying file system notifications are handled by the Notify rust library.
This package was previously named "watchgod", see the migration guide for more information.
Simple, modern and high performance file watching and code reload in python.
Documentation : watchfiles.helpmanual.io
Source Code : github.com/samuelcolvin/watchfiles
Underlying file system notifications are handled by the Notify rust library.
This package was previously named "watchgod", see the migration guide for more information.
samuelcolvin • GitHub - samuelcolvin/watchfiles: Simple, modern and fast file watching and code reload in python.
Nicolay Gerold added 9mo
Indexify is a reactive structured extraction engine for un-structured data.
Applications leveraging LLMs for autonomous planning or queries necessitate timely index updates aligned with data changes or new extraction methods. Indexify enables both, by applying feature extractors on data in real-time and updating one or many indexes.
Why use Indexify
Applications leveraging LLMs for autonomous planning or queries necessitate timely index updates aligned with data changes or new extraction methods. Indexify enables both, by applying feature extractors on data in real-time and updating one or many indexes.
Why use Indexify
tensorlakeai • GitHub - tensorlakeai/indexify: A scalable realtime and continuous indexing engine for Unstructured Data to build Generative AI Applications
Nicolay Gerold added 10mo
WebDataset
WebDataset is a library for writing I/O pipelines for large datasets. Its sequential I/O and sharding features make it especially useful for streaming large-scale datasets to a DataLoader.
The WebDataset format
A WebDataset file is a TAR archive containing a series of data files. All successive data files with the same prefix are consider... See more
WebDataset is a library for writing I/O pipelines for large datasets. Its sequential I/O and sharding features make it especially useful for streaming large-scale datasets to a DataLoader.
The WebDataset format
A WebDataset file is a TAR archive containing a series of data files. All successive data files with the same prefix are consider... See more
WebDataset
Nicolay Gerold added 10mo
Snowplow
1️⃣ Snowplow is made with developers in mind. It currently offers over 20 SDKs to get data from the web, mobile, and server-side applications.
2️⃣ The known feature of Snowplow is the use of its unique schema-based approach and validation process. Its architecture ensures reliable data.
3️⃣ Snowplow supports integration with multiple data st... See more
1️⃣ Snowplow is made with developers in mind. It currently offers over 20 SDKs to get data from the web, mobile, and server-side applications.
2️⃣ The known feature of Snowplow is the use of its unique schema-based approach and validation process. Its architecture ensures reliable data.
3️⃣ Snowplow supports integration with multiple data st... See more
Bap • Our 5 favourite open-source customer data platforms
Nicolay Gerold added 7mo
Easily chunk complex documents the same way a human would.
Chunking documents is a challenging task that underpins any RAG system. High quality results are critical to a sucessful AI application, yet most open-source libraries are limited in their ability to handle complex documents.
Open Parse is designed to fill this gap by providing a flexible, e... See more
Chunking documents is a challenging task that underpins any RAG system. High quality results are critical to a sucessful AI application, yet most open-source libraries are limited in their ability to handle complex documents.
Open Parse is designed to fill this gap by providing a flexible, e... See more
Filimoa • GitHub - Filimoa/open-parse: Improved file parsing for LLM’s
Nicolay Gerold added 7mo