
On the Biology of a Large Language Model
transformer-circuits.pub
Anthropic can now track the bizarre inner workings of a large language model
Will Douglas Heaventechnologyreview.com
We currently don't understand how to make sense of the neural activity within language models. Today, we are sharing improved methods for finding a large number of "features"—patterns of activity that we hope are human interpretable.
Powerful AI systems can help us interpret the neurons of weaker AI systems. And those interpretability insights often tell us a bit about how models work. And when they tell us how models work, they often suggest ways that those models could be better or more efficient. —Dario Amodei, Anthropic