thoughts
Benchmarks are often designed for papers, not products. The result? We get IFEval – a test that measures whether AI can make the letter "n" appear three times. Riveting.
Many datasets themselves are deeply flawed. That's why 30% of "Humanity's Last Exam" answers are wrong and 36% of HellaSwag contains errors. Garbage in,... See more
Benchmarks are broken. Here's why frontier labs treat them as PR.
superintelligence teams rely on human evaluations. humans can measure nuance, creativity, and wisdom – things benchmarks can’t.
AI favors experts. The more expertise you have, the better you can wield AI as a tool.
even if you make something truly great, it will still take many years before society catches up and figures it out
When I hear people say things like “our goal is to be at 5mm ARR by 2024” I always get a little sad and uncomfortable. I don’t have quantifiable goals like that. And if I do, any number I choose feels arbitrary. My goals are usually more qualitative - to create simple, delightful products that are meaningful for the people they serve, to create
... See moreThings I'm thinking about
you can always try to be more cheerful and charming to everyone around you
when everything is hyper-optimized, go raw
from Martina Navratilova
"Labels are for filing. Labels are for clothing. Labels are not for people."