The fact that most individual neurons are uninterpretable presents a serious roadblock to a mechanistic understanding of language models. We demonstrate a method for decomposing groups of neurons into interpretable features with the potential to move past that roadblock.
Anthropictwitter.comThe fact that most individual neurons are uninterpretable presents a serious roadblock to a mechanistic understanding of language models. We demonstrate a method for decomposing groups of neurons into interpretable features with the potential to move past that roadblock.

I built a tool that can help open the black box of AIs.
It is called ‘Programmed Conceptual Deconstruction’. It systematically breaks down any initial concept (Ci) into its constituent, related concepts through an iterative process until further decomposition is no longer possible.
It is co... See more