Fine-Tuning LLMs for ‘Good’ Behavior Makes Them More Likely ...

Fine-Tuning LLMs for ‘Good’ Behavior Makes Them More Likely to Say No

RelatedInsightsHighlights

The LLMentalist Effect: How Chat-Based Large Language Models Replicate the Mechanisms of a Psychic’s Con

Why language models hallucinate

In June, researchers at OpenAI reported the results of their own tests of emergent misalignment (opens a new tab). Their work suggests that during pretraining, an AI learns a variety of personality types, which the researchers call personas. Fine-tuning the model on insecure code or incorrect medical advice can amplify a “misaligned persona” — one... See more