I’ve noticed that many GenAI application projects put in automated evaluations (evals) of the system’s output probably later — and rely on humans to judge outputs longer — than they should. This is because building evals is viewed as a massive investment (say, creating 100 or 1,000 examples, and designing and validating metrics) and there’s never a... See more
Andrew Ngx.com