Beyond Vibe Checks: A PM’s Complete Guide to Evals
Why your RAG system is failing despite "great" embedding scores
I just watched Kelly Hong from Chroma present their research on generative benchmarking, and it's a wake-up call for anyone building retrieval systems.
The uncomfortable truth: your embedding model might be crushing MTEP benchma... See more
jason liux.com@chipro Trends I’m seeing are:
1. multi-stage evals (i.e. final result conditional upon subtasks)
2. retroactive evals (i.e. marking subtasks as failures after a complete run of other evals in a sequence)
3. eval pipeline circuit breakers (I.e. budget limits, la... See more
Alex Reibman 🖇️x.com
WTF are evals?
Evals are how you measure the quality and effectiveness of your AI system. They act like regression tests or benchmarks, clearly defining what “good” actually looks like for your AI product beyond the kind of simple latency or pass/fail checks you’d usually use for software.
E... See more