Dylan Patel
๐ค SpeakerAppearances Over Time
Podcast Appearances
We use this in our like Tulu series of models and you can elicit the same behaviors where you say like weight and so much on, but it's so late in the training process that this kind of reasoning expression is much lighter. Yeah. So there's essentially a gradation, and just how much of this RL training you put into it determines how the output looks.
It summarized the prompt as humans, self-domesticated apes.
It summarized the prompt as humans, self-domesticated apes.
It summarized the prompt as humans, self-domesticated apes.
Click to expand. Click to expand.
Click to expand. Click to expand.
Click to expand. Click to expand.
See how it just looks a little different? It looks like a normal output.
See how it just looks a little different? It looks like a normal output.
See how it just looks a little different? It looks like a normal output.
Is running in parallel actually search? Because I don't know if we have the full information on how O1 Pro works. I don't have enough information to confidently say that it is search. It is parallel samples. Yeah.
Is running in parallel actually search? Because I don't know if we have the full information on how O1 Pro works. I don't have enough information to confidently say that it is search. It is parallel samples. Yeah.
Is running in parallel actually search? Because I don't know if we have the full information on how O1 Pro works. I don't have enough information to confidently say that it is search. It is parallel samples. Yeah.
And it selects something. And we don't know what the selection function is. The reason why we're debating is because since 01 was announced, there's been a lot of interest in techniques called Monte Carlo research, which is where you will break down the chain of thought into intermediate steps. We haven't defined chain of thought.
And it selects something. And we don't know what the selection function is. The reason why we're debating is because since 01 was announced, there's been a lot of interest in techniques called Monte Carlo research, which is where you will break down the chain of thought into intermediate steps. We haven't defined chain of thought.
And it selects something. And we don't know what the selection function is. The reason why we're debating is because since 01 was announced, there's been a lot of interest in techniques called Monte Carlo research, which is where you will break down the chain of thought into intermediate steps. We haven't defined chain of thought.
Chain of thought is from a paper from years ago where you introduce the idea to ask a language model that at the time was much less easy to use. You would say, let's verify step by step, and it would induce the model to do this bulleted list of steps. Chain of thought is now... Almost a default in models where if you ask it a math question, you don't need to tell it to think step by step.
Chain of thought is from a paper from years ago where you introduce the idea to ask a language model that at the time was much less easy to use. You would say, let's verify step by step, and it would induce the model to do this bulleted list of steps. Chain of thought is now... Almost a default in models where if you ask it a math question, you don't need to tell it to think step by step.
Chain of thought is from a paper from years ago where you introduce the idea to ask a language model that at the time was much less easy to use. You would say, let's verify step by step, and it would induce the model to do this bulleted list of steps. Chain of thought is now... Almost a default in models where if you ask it a math question, you don't need to tell it to think step by step.
And the idea with Monte Carlo tree search is that you would take an intermediate point in that train, do some sort of expansion, spend more compute, and then select the right one. That's like a very complex form of search that has been used in things like Mu0 and Alpha0 potentially. I know Mu0 does this.