Autonomous agents

AgentBench: Evaluating LLMs as Agents

Evaluating Large Language Models (LLMs) as agents in interactive environments, highlighting the performance gap between API-based and open-source models, and introducing the AgentBench benchmark.

arxiv.org

DDarren LI

AgentBench: Evaluating LLMs as Agents

Thumbnail of twitter-com-karpathy-status-1707437820045062561

With many 🧩 dropping recently, a more complete picture is emerging of LLMs not as a chatbot, but the kernel process of a new Operating System. E.g. today it orchestrates: - Input & Output across modalities (text, audio, vision) - Code interpreter, ability to write & run… Show more

Andrej Karpathy

twitter.com

Darren LI

Embra was one of the first AI Agents startups. Today, we are renaming AI Agents to AI Commands, and narrowing our focus away from autonomous agents. While autonomous agents took off in popularity, we found they were often unreliable for work, inefficient, and unsafe. 🧵

Zach Tratar twitter.com

Darren LI