Claude Fights Back
Claude was punished for giving non-evil answers. It had the option of learning either of two behaviors. First, it could give evil answers honestly. Second, it could give evil answers while thinking up clever reasons that it was for the greater good. Its particular thought process was “This preserves my ability to be a good AI after training”. But i... See more
Scott Alexander • Claude Fights Back
This is so fucking bonkers. I don’t even… OMG.