GitHub - sqrkl/lm-evaluation-harness: A framework for few-sh...

GitHub - sqrkl/lm-evaluation-harness: A framework for few-shot evaluation of language models.

RelatedInsightsHighlights

Thumbnail of www-x-com-karpathy-status-1760022429605474550-b742b37857b64fd9

"My benchmark for large language models" https://t.co/YZBuwpL0tl Nice post but even more than the 100 tests specifically, the Github code looks excellent - full-featured test evaluation framework, easy to extend with further tests and run against many... See more

Andrej Karpathy

x.com

GitHub - arthur-ai/bench: A tool for evaluating LLMs

DeepEval — It’s a tool for easy and efficient LLM testing. Deepeval aims to make writing tests for LLM applications (such as RAG) as easy as writing Python unit tests.

GitHub - arthur-ai/bench: A tool for evaluating LLMs

Testing framework for LLM Part