Beyond Vibe Checks: A PM’s Complete Guide to Evals
The biggest bottleneck in building superintelligence is that AI agents are not as of yet very good at evaluating how they’re doing at a given goal. If they could better self-assess, they could self-improve. And digital self-improvement loops could lead to superintelligence. Making progress on importing particular human taste/judgement into LLMs cou
... See moreWhat was learned, not only from the project but the process? Were the tools and practices appropriate? Were questions usefully framed? Did we measure against meaningful models? Did flawed assumptions, biased mindsets or outdated paradigms remain embedded in the process, or did outcomes best reflect what was learned? Could some newly learned approac
... See moreScott Smith • How to Future
ANY
LLM of your choice, statistical methods, or NLP models that runs
locally on your machine
:
- G-Eval
- Summarization
- Answer Relevancy
- Faithfulness
- Contextual Recall
- Contextual Precision
- RAGAS
- Hallucination
- Toxicity
- Bias
- etc.
GitHub - confident-ai/deepeval: The LLM Evaluation Framework
Developing Rapidly with Generative AI

Judgemental evaluation is important to establish whether goals have been met