Beyond Vibe Checks: A PM’s Complete Guide to Evals

RelatedInsightsHighlights

Why your RAG system is failing despite "great" embedding scores I just watched Kelly Hong from Chroma present their research on generative benchmarking, and it's a wake-up call for anyone building retrieval systems. The uncomfortable truth: your embedding model might be crushing MTEP benchma... See more

jason liu x.com

@chipro Trends I’m seeing are: 1. multi-stage evals (i.e. final result conditional upon subtasks) 2. retroactive evals (i.e. marking subtasks as failures after a complete run of other evals in a sequence) 3. eval pipeline circuit breakers (I.e. budget limits, la... See more

Alex Reibman 🖇️x.com

Thumbnail of www-x-com-lennysan-status-1910124776104091676-9318eba5a40e4f14

WTF are evals? Evals are how you measure the quality and effectiveness of your AI system. They act like regression tests or benchmarks, clearly defining what “good” actually looks like for your AI product beyond the kind of simple latency or pass/fail checks you’d usually use for software. E... See more

Lenny Rachitsky

x.com