The fact that most individual neurons are uninterpretable pr...

The fact that most individual neurons are uninterpretable presents a serious roadblock to a mechanistic understanding of language models. We demonstrate a method for decomposing groups of neurons into interpretable features with the potential to move past that roadblock.

Anthropic twitter.com

RelatedInsightsHighlights

Thumbnail of www-x-com-austinc3301-status-1921003361853202589

Why is ~no one in the field of AI talking about Anthropic's On the Biology of a Large Language Model? For the first time, we get a pretty good glimpse of how LLMs reason through complex problems internally, but no one seems to be curious enough to care. https://t.co/CQRkYpfvGS

Agus

x.com

This paper finds LLMs' ability to understand that others have different beliefs (Theory of Mind) comes from 0.001% of their parameters. Break those specific weights & the model loses both its ability to track what others know AND language comprehension. Interesting implications. https://t.co/sBjG7L4eGZ

Ethan Mollick

x.com

Update on a new interpretable decomposition method for LLMs -- sparse mixtures of linear transforms (MOLT). Preliminary evidence suggests they may be more efficient, mechanistically faithful, and compositional than existing techniques like transcoders https://t.co/EgZhMB2IUU

Jack Lindsey x.com