The fact that most individual neurons are uninterpretable pr...

The fact that most individual neurons are uninterpretable presents a serious roadblock to a mechanistic understanding of language models. We demonstrate a method for decomposing groups of neurons into interpretable features with the potential to move past that roadblock.

Anthropic twitter.com

RelatedInsightsHighlights

This is a beautiful paper by Anthropic! My intuition: neural networks are voting networks. Imagine millions of entities voting: "In my incoming information, I detect this feature to be present with this strength". Aggregate a pyramid of votes to add up to output... See more

Paras Chopra

x.com

We're sharing progress toward understanding the neural activity of language models. We improved methods for training sparse autoencoders at scale, disentangling GPT-4’s internal representations into 16 million features—which often appear to correspond to understandable concepts. https://t.co/tTRPztmra1

OpenAI x.com

Our interpretability team recently released research that traced the thoughts of a large language model. Now we’re open-sourcing the method. Researchers can generate “attribution graphs” like those in our study, and explore them interactively.

Anthropic x.com