
Claude Fights Back

You can also test whether alignment techniques are good enough. You take a model and you train it to be evil and deceptive. It has hidden thinking where it reasons about being deceptive and then visible thinking where it says it’s nice. Then you give this model to a lab. They’re only allowed to see the visible thoughts, and they only get to train o... See more
Asterisk Issue 03: AI

AI models – especially Claude Sonnet 3.7 – often realize when they’re being evaluated for alignment.
Here’s an example of Claude's reasoning during a sandbagging evaluation, where it learns from documentation that it will not be deployed if it does well on a biology test: https://t.co/aiiV2tva6r