thoughts
The Problem with Benchmarks
Benchmarks are often designed for papers, not products. The result? We get IFEval – a test that measures whether AI can make the letter "n" appear three times. Riveting.
Many datasets themselves are deeply flawed. That's why 30% of "Humanity's Last Exam" answers are wrong and 36% of HellaSwag contains errors. Garbage in,... See more
Benchmarks are often designed for papers, not products. The result? We get IFEval – a test that measures whether AI can make the letter "n" appear three times. Riveting.
Many datasets themselves are deeply flawed. That's why 30% of "Humanity's Last Exam" answers are wrong and 36% of HellaSwag contains errors. Garbage in,... See more
Benchmarks are broken. Here's why frontier labs treat them as PR.
superintelligence teams rely on human evaluations. humans can measure nuance, creativity, and wisdom – things benchmarks can’t.
When I hear people say things like “our goal is to be at 5mm ARR by 2024” I always get a little sad and uncomfortable. I don’t have quantifiable goals like that. And if I do, any number I choose feels arbitrary. My goals are usually more qualitative - to create simple, delightful products that are meaningful for the people they serve, to create
... See moreBetter tools are not the bottleneck to creating great work. A person focusing on one thing the antiquated way - pen and paper or whatever - will beat the tool optimizers any day. A new piece of technology isn’t going to magically hone your craft. The work to hone your craft feels hard because it’s supposed to be hard.
I believe “what effect do you want to have on people” is one of the most important questions we should ask when we are making something. Life isn't just a series of problems to be solved but experiences to be had.