Our interpretability team recently released research that traced the thoughts of a large language model.
Now we’re open-sourcing the method. Researchers can generate “attribution graphs” like those in our study, and explore them interactively.
Anthropicx.com