The fact that most individual neurons are uninterpretable presents a serious roadblock to a mechanistic understanding of language models. We demonstrate a method for decomposing groups of neurons into interpretable features with the potential to move past that roadblock.
Anthropictwitter.comThe fact that most individual neurons are uninterpretable presents a serious roadblock to a mechanistic understanding of language models. We demonstrate a method for decomposing groups of neurons into interpretable features with the potential to move past that roadblock.

The Anthropic founder's essay on AI interpretability a must-read.
“AI will be central to the economy, technology, and national security.. It’s unacceptable for humanity to be totally ignorant of how they work.”
This is why we both backed @goodfireAI to create the AI MRI scan. https://t.co/D5XGPRSTDw
ICYMI: With one of the most important technologies of the modern world — neural networks — we’re still effectively building blind, but new work is aiming to change that.
https://t.co/IFDx3Un1jl
Quanta Magazinex.comPowerful AI systems can help us interpret the neurons of weaker AI systems. And those interpretability insights often tell us a bit about how models work. And when they tell us how models work, they often suggest ways that those models could be better or more efficient. —Dario Amodei, Anthropic