GitHub - NVIDIA/NeMo-Curator: Scalable toolkit for data curation
We are happy to announce Curator, an open-source library designed to streamline synthetic data generation!
High-quality synthetic data generation is essential in training and evaluating LLMs/agents/RAG pipelines these days, but tooling around this is still entirely lacking!
So we built... See more
Mahesh Sathiamoorthyx.comCrawl the web in an LLM-friendly style!
Introducing Crawl4AI 🤖🕷️which is a web data crawler that extracts semantically labeled chunks into JSON, along with clean HTML and markdown for RAG, fine-tuning, and AI chatbots.
This open-source tool offers efficient crawling and multi-URL support.... See more
Unclecode (Hossein)x.com