Pytest for LLM Apps!
DeepEval turns LLM evaluation into a two-line test suite, helping you identify the best models, prompts, and architecture for your AI workflow.
Works with any framework, including LlamaIndex, Langchain, CrewAI, and more.
100% open-source,... See more
Wrote an intro to evals for long-context Q&A systems:
• How it differs from basic Q&A
• What dimensions & metrics to eval on
• How to build llm-evaluators
• How to build eval datasets
• Benchmarks: narratives, technical docs,... See more