The fact that most individual neurons are uninterpretable presents a serious roadblock to a mechanistic understanding of language models. We demonstrate a method for decomposing groups of neurons into interpretable features with the potential to move past that roadblock.

The fact that most individual neurons are uninterpretable presents a serious roadblock to a mechanistic understanding of language models. We demonstrate a method for decomposing groups of neurons into interpretable features with the potential to move past that roadblock.

Anthropic can now track the bizarre inner workings of a large language model

Will Douglas Heaventechnologyreview.com
Thumbnail of Anthropic can now track the bizarre inner workings of a large language model

On the Biology of a Large Language Model

transformer-circuits.pub
Thumbnail of On the Biology of a Large Language Model