The fact that most individual neurons are uninterpretable presents a serious roadblock to a mechanistic understanding of language models. We demonstrate a method for decomposing groups of neurons into interpretable features with the potential to move past that roadblock.
The fact that most individual neurons are uninterpretable presents a serious roadblock to a mechanistic understanding of language models. We demonstrate a method for decomposing groups of neurons into interpretable features with the potential to move past that roadblock.
This is not only true at the model level but the individual level. Soon there will be such a thing as unhealthy and healthy prompting, a kind of psychological score for queries.
Anu Atluru • MEDIA AND MACHINES.
When a large language model ingests a sentence, it constructs what can be thought of as an “attention map.”