The fact that most individual neurons are uninterpretable presents a serious roadblock to a mechanistic understanding of language models. We demonstrate a method for decomposing groups of neurons into interpretable features with the potential to move past that roadblock.
Anthropictwitter.comThe fact that most individual neurons are uninterpretable presents a serious roadblock to a mechanistic understanding of language models. We demonstrate a method for decomposing groups of neurons into interpretable features with the potential to move past that roadblock.

This is a beautiful paper by Anthropic!
My intuition: neural networks are voting networks.
Imagine millions of entities voting: "In my incoming information, I detect this feature to be present with this strength".
Aggregate a pyramid of votes to add up to output... See more
We're sharing progress toward understanding the neural activity of language models. We improved methods for training sparse autoencoders at scale, disentangling GPT-4’s internal representations into 16 million features—which often appear to correspond to understandable concepts.
https://t.co/tTRPztmra1
OpenAIx.com