Fine-Tuning LLMs for ‘Good’ Behavior Makes Them More Likely to Say No
The LLMentalist Effect: How Chat-Based Large Language Models Replicate the Mechanisms of a Psychic’s Con
Baldur Bjarnasonsoftwarecrisis.devWhy language models hallucinate
openai.comIn June, researchers at OpenAI reported the results of their own tests of emergent misalignment (opens a new tab). Their work suggests that during pretraining, an AI learns a variety of personality types, which the researchers call personas. Fine-tuning the model on insecure code or incorrect medical advice can amplify a “misaligned persona” — one... See more