As LLMs get smarter, evals need to get harder.
OpenAI’s o1 has already maxed out most major benchmarks.
Scale is partnering with CAIS to launch Humanity’s Last Exam: the toughest open-source benchmark for LLMs.
We're putting up $500K in prizes for the best... See more