"My benchmark for large language models"
https://t.co/YZBuwpL0tl
Nice post but even more than the 100 tests specifically, the Github code looks excellent - full-featured test evaluation framework, easy to extend with further tests and run against many... See more
Andrew Ng said "Our eval tools aren’t ready for LLMs."
@LangWatchAI Evaluations Wizard solves this by simulating real-world interactions and running 30+ evaluators on your LLM app.
Works even if you have no eval dataset!
100% open-source. https://t.co/osImmD7bmw