GitHub - IBM/unitxt: 🦄 Unitxt: a python library for getting data fired up and set for training and evaluation

GitHub - IBM/unitxt: 🦄 Unitxt: a python library for getting data fired up and set for training and evaluation

github.com
Thumbnail of GitHub - IBM/unitxt: 🦄 Unitxt: a python library for getting data fired up and set for training and evaluation

Unstructured-IO GitHub - Unstructured-IO/unstructured: Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.

tinyclouds.org Google Brain Residency

Jimmy Cerone added

GitHub - run-llama/llama-hub: A library of data loaders for LLMs made by the community -- to be used with LlamaIndex and/or LangChain

GitHub - NVIDIA/NeMo-Curator: Scalable toolkit for data curation

CambioML GitHub - CambioML/uniflow-llm-based-pdf-extraction-text-cleaning-data-clustering: LLM-based text extraction from unstructured data like PDFs, Words and HTMLs. Transform and cluster the text into your desired format. Less information loss, more interpretation, and faster...

Jason Risch Self-Serve Apps for ML Teams | Greylock

huggingface GitHub - huggingface/datatrove: Freeing data processing from scripting madness by providing a set of platform-agnostic customizable pipeline processing blocks.

google GitHub - google/maxtext: A simple, performant and scalable Jax LLM!