The Q* hypothesis: Tree-of-thoughts reasoning, process reward models, and supercharging synthetic data
What is Q*? My first approach (1/2)
Foreword: Q* has not yet been published or made publicly available; there are no papers on it and OpenAI is holding back on information about it (Sam Altman: "We are not ready to talk about that" https://t.co/ah00i87Kbu , 2:45 minutes). Since the first hints, the community has been trying to figure out what Q* might be. In this article, I compile all the information I could find to paint a possible picture of Q*, including all plausible hypotheses and assumptions. The theses are backed up with various papers, articles and conclusions. Nevertheless, everything written here should be read with a certain "grain of salt".
It is also important to me to make this text as comprehensible and beginner-friendly as possible. The trick is to put complex issues into simple words. Furthermore, the text was written by hand and not by ChatGPT or another AI model. I therefore apologize if the text is sometimes too simple.
The beginning
About half a year ago, The Information and Reuters learned from OpenAI employees that a scientific breakthrough had been achieved at the research facility (although the first rumors emerged in December 2023). [1]. For the first time, a model has succeeded in learning by itself with the help of a new algorithm and acquiring logical (mathematical) skills without external influence. Something that transformer architectures cannot normally do due to their characteristics, as their results are outputs of probabilities (we remember the beginnings of 2017, "Attention is all you need"[2].
Logical thinking and self-learning are a necessity that many thinkers believe is a prerequisite for artificial general intelligence (although there is still no standard definition for AGI and Google offers the first approaches to a definition [3]. An AGI therefore requires absolute correctness in its output in order to be transferred to all (human) processes (something that OpenAI itself repeatedly emphasizes in blog posts: "In recent years, large language models have greatly improved in their ability to perform complex multi-step reasoning. However, even state-of-the-art models still produce logical errors, often called _hallucinations_. Mitigating hallucinations is a critical step towards building aligned AGI". Sam Altman put it even more concretely in a video: https://t.co/RAOKK97sY9) . In addition, AGI requires general knowledge at expert level in all areas of knowledge (one must not forget the "G" in AGI, generalization). In this respect, this breakthrough in OpenAI reported by Reuters and The Information seemed to be the key on the road to AGI, which may have frightened many.
Some content creators such as "AI Explained" (https://t.co/Ny21gxwLpk), "Matthew Bermann" (https://t.co/8ihpZdh5ij) have made excellent videos about this, which I can also highly recommend to you.
Reuters wrote at the time:
"Nov 22 (Reuters) - Ahead of OpenAI CEO [Sam Altman’s four days in exile](https://t.co/NTl25hG0eO), several staff researchers wrote a letter to the board of directors warning of a powerful artificial intelligence discovery that they said could threaten humanity, two people familiar with the matter told Reuters. (...) Some at OpenAI believe Q* (pronounced Q-Star) could be a breakthrough in the startup's search for what's known as artificial general intelligence (AGI), one of the people told Reuters. OpenAI defines AGI as autonomous systems that surpass humans in most economically valuable tasks. (...) Given vast computing resources, the new model was able to solve certain mathematical problems, the person said on condition of anonymity because the individual was not authorized to speak on behalf of the company. Though only performing math on the level of grade-school students, acing such tests made researchers very optimistic about Q*’s future success, the source said. (...) Researchers consider math to be a frontier of generative AI development. Currently, generative AI is good at writing and language translation by statistically predicting the next word, and answers to the same question can vary widely. But conquering the ability to do math — where there is only one right answer — implies AI would have greater reasoning capabilities resembling human intelligence. This could be applied to novel scientific research, for instance, AI researchers believe. (...) "Four times now in the history of OpenAI, the most recent time was just in the last couple weeks, I've gotten to be in the room, when we sort of push the veil of ignorance back and the frontier of discovery forward, and getting to do that is the professional honor of a lifetime," he said at the Asia-Pacific Economic Cooperation summit."
The Information wrote at the time [4]
"OpenAI Made an AI Breakthrough Before Altman Firing, Stoking Excitement and Concern One day before he was fired by OpenAI’s board last week, Sam Altman alluded to a recent technical advance the company had made that allowed it to “push the veil of ignorance back and the frontier of discovery forward.” (...) But some OpenAI employees believe Altman’s comments referred to an innovation by the company’s researchers earlier this year that would allow them to develop far more powerful artificial intelligence models, a person familiar with the matter said. (...) Jakub Pachocki and Szymon Sidor, two top researchers, used Sutskever’s work to build a model called Q* (pronounced “Q-Star”) that was able to solve math problems that it hadn’t seen before, an important technical milestone. A demo of the model circulated within OpenAI in recent weeks, and the pace of development alarmed some researchers focused on AI safety. The work of Sutskever’s team, which has not previously been reported, and the concern inside the organization, suggest that tensions within OpenAI about the pace of its work will continue even after Altman was reinstated as CEO Tuesday night, and highlights a potential divide among executives. (...) . Sutskever’s breakthrough allowed OpenAI to overcome limitations on obtaining enough high-quality data to train new models, according to the person with knowledge, a major obstacle for developing next-generation models. The research involved using computer-generated, rather than real-world, data like text or images pulled from the internet to train new models."
(Small note: The great fear and concern about Q* also stemmed from imagining that if Q* could already teach itself math without prior training (only elementary school level at first, but certainly more than that with enough compute), it could be at risk in the foreseeable future due to exponential development on all data encryption. What would stop an AI that teaches itself math from finding a solution to encryption if you just give the model enough time and compute).
Basically, one can say that Q* is a method (algorithm) for approximating language models to human thinking and its conclusions. It is a method for mapping step-by-step thinking, iterative thinking and thinking in process subdivisions and adapting them to large language models.
It is based on the systems thinking of Nobel Prize winner Daniel Kahnemann. According to Kahnemann, there are two thought processes, two systems in which people think, System 1 thinking and System 2 thinking. System 1 thinking is intuitive thinking, thinking that takes place automatically and intuitively. At the moment, the large language models can only think in a system 1 way by outputting a probability result based on their training data. This corresponds to intuitive thinking.
System 2 thinking, however, is complex thinking. It is the kind of thinking that involves thinking steps and process divisions. If we want to solve difficult mathematical problems, we cannot arrive at the result intuitively, but have to approach the solution step by step. This is what we need to teach language models, to approach results slowly, processually and iteratively.
Iterative solution finding through process subdivision
So how do you teach a language model to think in System 2? To start with, I'll quote myself briefly when I gave a short summary of Q* a few weeks ago to give a first approach. [5]
"Especially since Reuters published an article today in which they spoke to OpenAI employees who do not want to be named, Q* also seems to have come true. Q* cannot be overstated in its importance, Q* is the breakthrough par excellence in that it eradicates the biggest problems of LLMs. Instead of laboriously training LLMs using RLHF, a model with Q* is able to learn on its own. One article showed that a GPT with Q* taught itself math (at primary school level) without any external intervention: "Reuters and The Information both report that researchers had come up with a new way to make powerful AI systems and had created a new model, called Q* (pronounced Q star), that was able to perform grade-school-level math." [3]. This presumably led to two processes: Q-learning and A*-search. Q-learning teaches the model to learn by itself, something that Nobel Prize winner Daniel Kahnemann calls "system 2 thinking". In other words: thinking: "System 1 and System 2 thinking describes two distinct modes of cognitive processing introduced by Daniel Kahneman in his book Thinking, Fast and Slow. System 1 is fast, automatic, and intuitive, operating with little to no effort. This mode of thinking allows us to make quick decisions and judgments based on patterns and experiences. In contrast, System 2 is slow, deliberate, and conscious, requiring intentional effort. This type of thinking is used for complex problem-solving and analytical tasks where more thought and consideration are necessary."[4] By dividing thinking into sub-processes, the model is given a sense of security. Solution finding using A* is a search-aglorithm (similar to the Monte Carlo tree search algorithm) that finds the best solution: „A* (pronounced "A-star") is a graph traversal and pathfinding algorithm, which is used in many fields of computer science due to its completeness, optimality, and optimal efficiency. Given a weighted graph, a source node and a goal node, the algorithm finds the shortest path (with respect to the given weights) from source to goal.“ [4] The combination of Q-learning and A*-search teaches the model to think and find solutions independently and to correct itself. This means that the hallucination will stop and correct solutions will be output as results because the solution is not simply taken from the training data and is based on probability. This means that the biggest problem of LLMs, their inaccuracy, can be avoided and start to develop an accuracy to make them accessible for completely different academic sciences. However, Q*- is likely to be very computationally intensive. And this is where gpt-mini comes into play: my hypothesis is that OpenAI will use gpt-mini to reduce energy requirements and save compute, and perhaps also incorporate a Q* variant into a small model. This is just speculation, but the important thing is that gpt-mini creates the conditions for Q* to become a reality."
System 2 thinking is explained in more detail in a research paper by OpenAI ("Lets verify step by step", [6], published by Ilya Sutskever and Jan Leike (ex-OpenAI), among others. Similarly, the idea is already used today in prompts by telling a model to "think step by step" or "divide the task into subsections", which is of course only a superficial attempt to apply System 2 thinking, although the model is not designed for this in terms of its architecture ("Promoting techniques like "take a deep breath" and "think step by step" are now expanding into advanced methods for inference with parallel computation and heuristics (some fundamentals of search)." https://t.co/9mnVFmd5tR).
Ein Teil des Dokuments und seiner Schlussfolgerungen ist das so genannte "Process Reward Model (PRM)" (siehe unten). Im Prinzip handelt es sich dabei um eine Bewertung der einzelnen Prozessschritte. Anstatt das Ergebnis als Ganzes zu bewerten, werden Punkte für jeden Argumentationsschritt vergeben.
"This allows finer-tuned generation with reasoning problems, by sampling over the maximum average reward or other metrics, instead of just relying on one score (standard RMs are called outcome RMs in this literature). Using [Best-of-N sampling](https://t.co/7TGcWC4GvI), essentially generating a bunch of times and using the one that scored the highest by the reward model (the inference time cousin of Rejection Sampling popularized with Llama 2), PRMs outperform standard RMs on reasoning tasks." (ebd.)
This method is also supported by the so-called "Tree of Thoughts":
The paper "Tree of Thoughts: Deliberate Problem Solving with Large Language Models" presents a new framework called Tree of Thoughts (ToT), which is based on large language models and improves their problem-solving capabilities through structured and planned decision-making processes. In contrast to the traditional Chain of Thought (CoT) method, which relies on sequential decisions, ToT enables the simultaneous exploration of multiple thoughts and the evaluation of these paths to achieve more effective problem solving. [7].
The ToT framework consists of four main components:
1 Thought process decomposition: Decomposition of the problem into smaller, manageable steps (thoughts).
2 Thought generation: Generation of suggestions for the next thought step.
3. Status evaluation: Heuristic evaluation of the progress of different thought paths.
4 Search algorithm: Systematic exploration of the thought tree using algorithms such as breadth-first search (BFS) or depth-first search (DFS).
In experiments with tasks such as the "Game of 24", creative writing and mini crossword puzzles, ToT showed significant improvements over conventional methods. For example, ToT achieved a success rate of 74% in the "Game of 24", while the CoT method only achieved 4%.
So we see here too that the planned, structured and sequential decision is essential for the accuracy of the solution.
"The innovations that make this click are the chunking of reasoning steps and prompting a model to create new reasoning steps. ToT seems like the first "recursive" prompting technique for improving inference performance, which sounds remarkably close to the AI Safety concern of recursively self-improving models (though I am not an expert)". https://t.co/9mnVFmd5tR)
"Large language models are capable of solving tasks that require complex multistep reasoning by generating solutions in a step-by-step chain-of-thought format (Nye et al., 2021; Wei et al., 2022; Kojima et al., 2022). However, even stateof-the-art models are prone to producing falsehoods — they exhibit a tendency to invent facts in moments of uncertainty (Bubeck et al. , 2023). These hallucinations (Maynez et al., 2020) are particularly problematic in domains that require multi-step reasoning, since a single logical error is enough to derail a much larger solution. Detecting and mitigating hallucinations is essential to improve reasoning capabilities."
The article states that process-monitored models show better performance in solving complex mathematical problems. This process monitoring evaluates each intermediate step, similar to the evaluation of each node expansion in A*. The "chain of thought" of process monitoring is similar to Kahnemann's System 2 thinking in that it represents reasoned thinking that evaluates logical steps, similar to the process monitoring approach.
We can therefore see that System 2 thinking, i.e. thinking in process steps, not only leads to more precise results, but is also an essential component in solving complex tasks. There are various ways to do this. PRM can be part of finding solutions in Q*, as it originates from OpenAI's own research, and ToT presumably also. Unfortunately, a more precise classification is not yet possible and cannot be derived from the various sources.
Q* is probably a combination of Q-learning and A* search.
OpenAI's Q* algorithm is considered a breakthrough in AI research, particularly in the development of AI systems with human reasoning capabilities. Q* combines elements of Q-learning and A* (A-star search), which leads to an improvement in goal-oriented thinking and solution finding. This algorithm shows impressive capabilities in solving complex mathematical problems (without prior training data) and symbolizes an evolution towards general artificial intelligence (AGI). It is a fusion of Q-learning and A*-search (as others also suggest: https://t.co/9mnVFmd5tR). It is based on the idea of self-learning and predictive planning.
"Self-play is the idea that an agent can improve its gameplay by playing against slightly different versions of itself because it’ll progressively encounter more challenging situations. In the space of LLMs, it is almost certain that the largest portion of self-play will look like AI Feedback rather than competitive processes."
Look-ahead planning is the idea of using a model of the world to reason into the future and produce better actions or outputs. The two variants are based on [Model Predictive Control](https://t.co/eKCkMecEfB.) (MPC), which is often used on continuous states, and [Monte-Carlo Tree Search](https://t.co/JeNjjIIlF0) (MCTS), which works with discrete actions and states."
What is Q-learning? Different theories
Theorie 1:
"[Q-learning](https://t.co/trAIJTiB0u) is a type of reinforcement learning, a method where AI learns to make decisions by trial and error. In Q-learning, an agent learns to make decisions by estimating the “quality” of action-state combinations. **The difference between this approach and OpenAI’s current approach—known as Reinforcement Learning Through Human Feedback or [RLHF](https://t.co/2yQZFOaZpo)—is that it does not rely on human interaction and does everything on its own**. Imagine a robot navigating a maze. With Q-learning, it learns to find the quickest path to the exit by trying different routes, receiving positive rewards set by its own design when it moves closer to the exit and negative rewards when it hits a dead end. Over time, through trial and error, the robot develops a strategy (a "Q-table") that tells it the best action to take from each position in the maze. This process is autonomous, relying on the robot's interactions with its environment. (...) **In Q-learning, Q* represents the desired state in which an agent knows exactly the best action to take in every state to maximize its total expected reward over time**. In math terms, it satisfies the [Bellman Equation](https://t.co/wGR99gUhQl)."
Theorie 2 Algorithm from MRPPS:
"One way to explain the process is to consider the fictitious detective Sherlock Holmes trying to solve a complex case. He gathers clues (semantic information) and connects them logically (syntactic information) to reach a conclusion. The Q* algorithm works similarly in AI, combining semantic and syntactic information to navigate complex problem-solving processes.
This would imply that OpenAI is one step closer to having a model capable of understanding its reality beyond mere text prompts and more in line with the fictional J.A.R.V.I.S (for GenZers) or the Bat Computer (for boomers).
So, while Q-learning is about teaching AI to learn from interaction with its environment, the Q _algorithm is more about improving AI's deductive capabilities. Understanding these distinctions is key to appreciating the potential implications OpenAI's “Q_.” Both hold immense potential in advancing AI, but their applications and implications vary significantly."
Of course, we don't know what might be relevant in Q*. However, I clearly lean towards theory 1 because it is more consistent with the papers already published by OpenAI.
What is A* search?
A* search is the method of finding the correct path between a start state and a target state. It uses a heuristic function to calculate the estimated cost and find the best path. It also guarantees that the solution found is optimal if the heuristic is admissible (i.e. does not overestimate the cost). In short, the algorithm finds the shortest or cheapest solution if the heuristic is admissible, is multifunctional for different problems or questions (flexible), adaptable and robust. A* is similar to Monte Carlo Tree Search (MCTS) in some ways, but it is fundamentally different and better because it uses heuristics for optimal path finding instead of random simulations for decision making (MCTS). In other words, A* systematically searches for the best path, while MCTS uses random simulations for decision making. (A* SEARCH WITHOUT EXPANSIONS: LEARNING HEURISTIC FUNCTIONS WITH DEEP Q-NETWORKS [8]).
Q* uses the principle of A* to find the best path by combining path costs and heuristic values. By integrating DQNs, Q* can calculate the costs and heuristic values of the child nodes in a single pass, which significantly reduces the algorithmic effort. - The stepwise computation and validation in Q* is similar to the process monitoring used in STaR to minimize hallucinations (STaR see below).
One meta-scientist summarized this on Twitter as follows
"From my past experience on OpenGo (reproduction of AlphaZero), A* can be regarded as a deterministic version of MCTS with value (i.e., heuristic) function Q only. This should be suitable for tasks in which the state is easy to evaluate given the action, but the actions are much harder to predict given the state. Math problems seem to fit this scenario quite well." (https://t.co/ESiiGMkpT7... See more
Merchury Charon added
core components of Deep RL that enabled success like AlphaGo: self-play and look-ahead planning.
Self-play is the idea that an agent can improve its gameplay by playing against slightly different versions of itself because it’ll progressively encounter more challenging situations. In the space of LLMs, it is almost certain that the largest portion o... See more
Self-play is the idea that an agent can improve its gameplay by playing against slightly different versions of itself because it’ll progressively encounter more challenging situations. In the space of LLMs, it is almost certain that the largest portion o... See more
Shortwave — rajhesh.panchanadhan@gmail.com [Gmail alternative]
Nicolay Gerold added
These two components might be some of the most important ideas to improve all of AI.
Nicolay Gerold and added
Observing these patterns of influence gives clues about how our models generalize from their training data. For instance, if the models responded to user prompts by splicing together sequences from the training set, then we’d expect the influential sequences for a given model response to include expressions of near-identical thoughts. Conversely, i... See more
Anthropic \ Tracing Model Outputs to the Training Data
Nicolay Gerold added
Memorization vs learning concepts and forming models of the world.
3 days reading through the code of the paper that solved this, here are some notes:
- Long term interactions: the promise of LLM agents is to potentially run over weeks. A lot of the ideas in the paper tackle exactly this problem
- Memory: It’s easy to write down interactions of agents. But how do we track these over time? How do these change with ti... See more
- Long term interactions: the promise of LLM agents is to potentially run over weeks. A lot of the ideas in the paper tackle exactly this problem
- Memory: It’s easy to write down interactions of agents. But how do we track these over time? How do these change with ti... See more
Sanyam Bhutani • Tweet
Nicolay Gerold added
- What a modern LLM does during training is, essentially, very very quickly skim the textbook, the words just flying by , not spending much brain power on it.
- Rather, when you or I read that math textbook, we read a couple pages slowly; then have an internal monologue about the material in our heads and talk about it with a few study-buddies; read an
SITUATIONAL AWARENESS - The Decade Ahead • I. From GPT-4 to AGI: Counting the OOMs
Max Beauroyre added