GitHub - dlt-hub/dlt: data load tool (dlt) is an open source Python library that makes data loading easy 🛠️
Original creator : Jesse Zhang (GH: emptycrown, Twitter: @thejessezhang), who courteously donated the repo to LlamaIndex!
This is a simple library of all the data loaders / readers / tools / llama-packs that have been created by the community. The goal is to make it extremely easy to connect large language models to a large variety of knowledge sour... See more
This is a simple library of all the data loaders / readers / tools / llama-packs that have been created by the community. The goal is to make it extremely easy to connect large language models to a large variety of knowledge sour... See more
GitHub - run-llama/llama-hub: A library of data loaders for LLMs made by the community -- to be used with LlamaIndex and/or LangChain
![Thumbnail of Discover, Download, and Run Local LLMs](https://s3.amazonaws.com/public-storage-prod.startupy.com/media/images/thumbnails/curation/e748a243ffdd41a8b32e6f546c545c54/thumbnail.png)
LlamaHub is a library of data loaders, readers, and tools created by the LlamaIndex community. It provides utilities to easily connect LLMs to diverse knowledge sources.
Ben Auffarth • Generative AI with LangChain: Build large language model (LLM) apps with Python, ChatGPT, and other LLMs
DataTrove
DataTrove is a library to process, filter and deduplicate text data at a very large scale. It provides a set of prebuilt commonly used processing blocks with a framework to easily add custom functionality.
DataTrove processing pipelines are platform-agnostic, running out of the box locally or on a slurm cluster. Its (relatively) low memory... See more
DataTrove is a library to process, filter and deduplicate text data at a very large scale. It provides a set of prebuilt commonly used processing blocks with a framework to easily add custom functionality.
DataTrove processing pipelines are platform-agnostic, running out of the box locally or on a slurm cluster. Its (relatively) low memory... See more
huggingface • GitHub - huggingface/datatrove: Freeing data processing from scripting madness by providing a set of platform-agnostic customizable pipeline processing blocks.
The last core data stack tool is the orchestrator. It’s used quickly as a data orchestrator to model dependencies between tasks in complex heterogeneous cloud environments end-to-end. It is integrated with above-mentioned open data stack tools. They are especially effective if you have some glue code that needs to be run on a certain cadence, trigg... See more
Data Engineering • The Open Data Stack Distilled into Four Core Tools
Clean & curate your data with LLMs
databonsai is a Python library that uses LLMs to perform data cleaning tasks.
Features
databonsai is a Python library that uses LLMs to perform data cleaning tasks.
Features
- Suite of tools for data processing using LLMs including categorization, transformation, and extraction
- Validation of LLM outputs
- Batch processing for token savings
- Retry logic with exponential backoff for handling rate limits an
databonsai • GitHub - databonsai/databonsai: clean & curate your data with LLMs.
jupyter-naas/awesome-notebooks: Data & AI Notebook ... - GitHub
github.com![Thumbnail of jupyter-naas/awesome-notebooks: Data & AI Notebook ... - GitHub](https://s3.amazonaws.com/public-storage-prod.startupy.com/media/images/thumbnails/curation/5a4f1ddb03a94f4699ef10c7a5483cff/thumbnail.png)
Data science teams can use Baseten to efficiently serve, integrate, design, and ship their custom machine learning models with ease. A key benefit of Baseten is that it collapses the innovation cycle for ML apps, resulting in cheaper experimentation and greater success. It unblocks ML efforts currently bottlenecked by infrastructure, frontend, and ... See more