Evaluating Large Language Models (LLMs) as agents in interactive environments, highlighting the performance gap between API-based and open-source models, and introducing the AgentBench benchmark.
Analysis of safety preparations and evaluations for GPT-4V, a multimodal language model with image analysis capabilities, including early access testing, red teaming, and mitigations for potential risks and limitations.
We still need critical thinking. Ethical thinking. Systematic thinking. We still need to foster relationships. To build bridges. To coordinate. To orchestrate. These are human things. These are the skills that designers and developers need to cultivate and grow in order to continue to be viable in our AI age.