GitHub - IBM/unitxt: 🦄 Unitxt: a python library for getting data fired up and set for training and evaluation
Open-Source Pre-Processing Tools for Unstructured Data
The unstructured library provides open-source components for ingesting and pre-processing images and text documents, such as PDFs, HTML, Word docs, and many more. The use cases of unstructured revolve around streamlining and optimizing the data processing workflow for LLMs. unstructured modular... See more
The unstructured library provides open-source components for ingesting and pre-processing images and text documents, such as PDFs, HTML, Word docs, and many more. The use cases of unstructured revolve around streamlining and optimizing the data processing workflow for LLMs. unstructured modular... See more
Unstructured-IO • GitHub - Unstructured-IO/unstructured: Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.
Nicolay Gerold added
From a software maintenance perspective there is little consensus on how to organize ML projects. It feels like websites before Rails came out: a bunch of random PHP scripts with an unholy mixture of business logic and markup sprinkled throughout.
tinyclouds.org • Google Brain Residency
Jimmy Cerone added
.pocket Wow. Still the same today. Big opportunity. Though folks like Hugging Face are starting to change this.
Original creator : Jesse Zhang (GH: emptycrown, Twitter: @thejessezhang), who courteously donated the repo to LlamaIndex!
This is a simple library of all the data loaders / readers / tools / llama-packs that have been created by the community. The goal is to make it extremely easy to connect large language models to a large variety of knowledge sour... See more
This is a simple library of all the data loaders / readers / tools / llama-packs that have been created by the community. The goal is to make it extremely easy to connect large language models to a large variety of knowledge sour... See more
GitHub - run-llama/llama-hub: A library of data loaders for LLMs made by the community -- to be used with LlamaIndex and/or LangChain
Nicolay Gerold added
NeMo Curator
NeMo Curator is a Python library specifically designed for scalable and efficient dataset preparation. It greatly accelerates data curation by leveraging GPUs with Dask and RAPIDS, resulting in significant time savings. The library provides a customizable and modular interface, simplifying pipeline expansion and accelerating model conv... See more
NeMo Curator is a Python library specifically designed for scalable and efficient dataset preparation. It greatly accelerates data curation by leveraging GPUs with Dask and RAPIDS, resulting in significant time savings. The library provides a customizable and modular interface, simplifying pipeline expansion and accelerating model conv... See more
GitHub - NVIDIA/NeMo-Curator: Scalable toolkit for data curation
Nicolay Gerold added
uniflow provides a unified LLM interface to extract and transform and raw documents.
- Document types: Uniflow enables data extraction from PDFs, HTMLs and TXTs.
- LLM agnostic: Uniflow supports most common-used LLMs for text tranformation, including
- OpenAI models (GPT3.5 and GPT4),
- Google Gemini models (Gemini 1.5, MultiModal),
- AWS BedRock models,
- Huggingf
CambioML • GitHub - CambioML/uniflow-llm-based-pdf-extraction-text-cleaning-data-clustering: LLM-based text extraction from unstructured data like PDFs, Words and HTMLs. Transform and cluster the text into your desired format. Less information loss, more interpretation, and faster...
Nicolay Gerold added
Data science teams can use Baseten to efficiently serve, integrate, design, and ship their custom machine learning models with ease. A key benefit of Baseten is that it collapses the innovation cycle for ML apps, resulting in cheaper experimentation and greater success. It unblocks ML efforts currently bottlenecked by infrastructure, frontend, and ... See more
Jason Risch • Self-Serve Apps for ML Teams | Greylock
Mo Shafieeha added
DataTrove
DataTrove is a library to process, filter and deduplicate text data at a very large scale. It provides a set of prebuilt commonly used processing blocks with a framework to easily add custom functionality.
DataTrove processing pipelines are platform-agnostic, running out of the box locally or on a slurm cluster. Its (relatively) low memory... See more
DataTrove is a library to process, filter and deduplicate text data at a very large scale. It provides a set of prebuilt commonly used processing blocks with a framework to easily add custom functionality.
DataTrove processing pipelines are platform-agnostic, running out of the box locally or on a slurm cluster. Its (relatively) low memory... See more
huggingface • GitHub - huggingface/datatrove: Freeing data processing from scripting madness by providing a set of platform-agnostic customizable pipeline processing blocks.
Nicolay Gerold added
Overview
MaxText is a high performance , highly scalable , open-source LLM written in pure Python/Jax and targeting Google Cloud TPUs and GPUs for training and inference . MaxText achieves high MFUs and scales from single host to very large clusters while staying simple and "optimization-free" thanks to the power of Jax and the XLA compiler.
MaxText... See more
MaxText is a high performance , highly scalable , open-source LLM written in pure Python/Jax and targeting Google Cloud TPUs and GPUs for training and inference . MaxText achieves high MFUs and scales from single host to very large clusters while staying simple and "optimization-free" thanks to the power of Jax and the XLA compiler.
MaxText... See more
google • GitHub - google/maxtext: A simple, performant and scalable Jax LLM!
Nicolay Gerold added