GitHub - sqrkl/lm-evaluation-harness: A framework for few-sh...

GitHub - sqrkl/lm-evaluation-harness: A framework for few-shot evaluation of language models.

RelatedInsightsHighlights

Thumbnail of www-x-com-karpathy-status-1760022429605474550-b742b37857b64fd9

"My benchmark for large language models" https://t.co/YZBuwpL0tl Nice post but even more than the 100 tests specifically, the Github code looks excellent - full-featured test evaluation framework, easy to extend with further tests and run against many... See more

Andrej Karpathy

x.com

Introducing OpenBench 0.1: Open, Reproducible Evals 🧵 https://t.co/S5LlHEzDxv

Aarush Sah x.com

Andrew Ng said "Our eval tools aren’t ready for LLMs." @LangWatchAI Evaluations Wizard solves this by simulating real-world interactions and running 30+ evaluators on your LLM app. Works even if you have no eval dataset! 100% open-source. https://t.co/osImmD7bmw

Avi Chawla

x.com