The fact that most individual neurons are uninterpretable presents a serious roadblock to a mechanistic understanding of language models. We demonstrate a method for decomposing groups of neurons into interpretable features with the potential to move past that roadblock.
Anthropictwitter.comThe fact that most individual neurons are uninterpretable presents a serious roadblock to a mechanistic understanding of language models. We demonstrate a method for decomposing groups of neurons into interpretable features with the potential to move past that roadblock.

This paper finds LLMs' ability to understand that others have different beliefs (Theory of Mind) comes from 0.001% of their parameters. Break those specific weights & the model loses both its ability to track what others know AND language comprehension.
Interesting implications. https://t.co/sBjG7L4eGZ
Update on a new interpretable decomposition method for LLMs -- sparse mixtures of linear transforms (MOLT). Preliminary evidence suggests they may be more efficient, mechanistically faithful, and compositional than existing techniques like transcoders https://t.co/EgZhMB2IUU
Jack Lindseyx.com