HOW IS THIS ALPHA EVEN PUBLIC? 10x SEARCH DEPTH VIA GRPO
The intuition has always been that scaling agentic search is a compute problem. It’s not. It’s a "stability-of-objective" problem. Most 8B models suffer from "horizon collapse" - they are mathematically "anxious" to terminate the search loop because their training... See more
Great paper on why RL actually works for LLM reasoning.
Apparently, "aha moments" during training aren't random. They're markers of something deeper.
Researchers analyzed RL training dynamics across eight models, including Qwen, LLaMA, and vision-language models. The findings challenge how... See more
What is Q*? My first approach (1/2)
Foreword: Q* has not yet been published or made publicly available; there are no papers on it and OpenAI is holding back on information about it (Sam Altman: "We are not ready to talk about that" https://t.co/ah00i87Kbu , 2:45 minutes). Since the first hints, the community has been... See more