Emerging reasoning with reinforcement learning

The way you taught chain-of-thought before was with supervised fine tuning (SFT). During training, you have to rate every sentence of reasoning the model writes, many times, to nudge it to reason correctly.

But this approach to teach chain-of-thought doesn’t do that. In this post, they take a small model (7B) that already knows math. Then they give... See more

Emerging reasoning with reinforcement learning | Hacker News

Emerging reasoning with reinforcement learning | Hacker News