GitHub - NVIDIA/NeMo-Curator: Scalable toolkit for data curation
We are happy to announce Curator, an open-source library designed to streamline synthetic data generation!
High-quality synthetic data generation is essential in training and evaluating LLMs/agents/RAG pipelines these days, but tooling around this is still entirely lacking!
So we built... See more
Mahesh Sathiamoorthyx.comClean & curate your data with LLMs
databonsai is a Python library that uses LLMs to perform data cleaning tasks.
Features
databonsai is a Python library that uses LLMs to perform data cleaning tasks.
Features
- Suite of tools for data processing using LLMs including categorization, transformation, and extraction
- Validation of LLM outputs
- Batch processing for token savings
- Retry logic with exponential backoff for handling rate limits and
databonsai • GitHub - databonsai/databonsai: clean & curate your data with LLMs.
Model Explorer is a powerful graph visualization tool that helps one understand, debug, and optimize ML models. It specializes in visualizing large graphs in an intuitive, hierarchical format, but works well for smaller models as well.
Graph visualization plays a pivotal role in the machine learning (ML) development process. Visual representations... See more
Graph visualization plays a pivotal role in the machine learning (ML) development process. Visual representations... See more
Model Explorer: Graph visualization for large model development
Crawl the web in an LLM-friendly style!
Introducing Crawl4AI 🤖🕷️which is a web data crawler that extracts semantically labeled chunks into JSON, along with clean HTML and markdown for RAG, fine-tuning, and AI chatbots.
This open-source tool offers efficient crawling and multi-URL support.... See more
Unclecode (Hossein)x.com