Evaluating Large Language Models (LLMs) as agents in interactive environments, highlighting the performance gap between API-based and open-source models, and introducing the AgentBench benchmark.
the challenge. What is good? What is interesting? That part of the work is taste. “Taste is what enables designers to navigate the vast sea of possibilities that technology and global connectivity afford, and to then select and combine these elements in ways that, ideally, result in interesting, unique work