Dylan Patel
๐ค SpeakerAppearances Over Time
Podcast Appearances
For math, it might not be as effective, but if you take a 1 billion parameter model, so something 600 times smaller than DeepSeq, you can boost its grade school math scores very directly with a small amount of this training. So it's not to say that this is coming soon.
For math, it might not be as effective, but if you take a 1 billion parameter model, so something 600 times smaller than DeepSeq, you can boost its grade school math scores very directly with a small amount of this training. So it's not to say that this is coming soon.
Setting up the verification domains is extremely hard and there's a lot of nuance in this, but there are some basic things that we have seen before where it's at least expectable that there's a domain and there's a chance that this works.
Setting up the verification domains is extremely hard and there's a lot of nuance in this, but there are some basic things that we have seen before where it's at least expectable that there's a domain and there's a chance that this works.
Setting up the verification domains is extremely hard and there's a lot of nuance in this, but there are some basic things that we have seen before where it's at least expectable that there's a domain and there's a chance that this works.
Something I would say about these reasoning models is we talked a lot about reasoning training on math and code. And what is done is that you have the base model we've talked about a lot on the internet. You do this large scale reasoning training with reinforcement learning.
Something I would say about these reasoning models is we talked a lot about reasoning training on math and code. And what is done is that you have the base model we've talked about a lot on the internet. You do this large scale reasoning training with reinforcement learning.
Something I would say about these reasoning models is we talked a lot about reasoning training on math and code. And what is done is that you have the base model we've talked about a lot on the internet. You do this large scale reasoning training with reinforcement learning.
And then what the DeepSeek paper detailed in this R1 paper, which for me is one of the big open questions on how do you do this, is that they did... reasoning-heavy but very standard post-training techniques after the large-scale reasoning RL.
And then what the DeepSeek paper detailed in this R1 paper, which for me is one of the big open questions on how do you do this, is that they did... reasoning-heavy but very standard post-training techniques after the large-scale reasoning RL.
And then what the DeepSeek paper detailed in this R1 paper, which for me is one of the big open questions on how do you do this, is that they did... reasoning-heavy but very standard post-training techniques after the large-scale reasoning RL.
So they did the same things with a form of instruction tuning through rejection sampling, which is essentially heavily filtered instruction tuning with some reward models. And then they did this RLHF, but they made it math-heavy. So some of this transfer, we looked at this philosophical example early on, one of the big open questions is how much does this transfer?
So they did the same things with a form of instruction tuning through rejection sampling, which is essentially heavily filtered instruction tuning with some reward models. And then they did this RLHF, but they made it math-heavy. So some of this transfer, we looked at this philosophical example early on, one of the big open questions is how much does this transfer?
So they did the same things with a form of instruction tuning through rejection sampling, which is essentially heavily filtered instruction tuning with some reward models. And then they did this RLHF, but they made it math-heavy. So some of this transfer, we looked at this philosophical example early on, one of the big open questions is how much does this transfer?
If we bring in domains after the reasoning training, are all the models going to become eloquent writers by reasoning? Is this philosophy stuff going to be open? We don't know in the research of how much this will transfer. There's other things about how we can make soft verifiers and things like this. But there is more training after reasoning, which makes it easier to use these reasoning models.
If we bring in domains after the reasoning training, are all the models going to become eloquent writers by reasoning? Is this philosophy stuff going to be open? We don't know in the research of how much this will transfer. There's other things about how we can make soft verifiers and things like this. But there is more training after reasoning, which makes it easier to use these reasoning models.
If we bring in domains after the reasoning training, are all the models going to become eloquent writers by reasoning? Is this philosophy stuff going to be open? We don't know in the research of how much this will transfer. There's other things about how we can make soft verifiers and things like this. But there is more training after reasoning, which makes it easier to use these reasoning models.
And that's what we're using right now. So we're going to talk about with 3Mini and O1. These have gone through these extra techniques that are designed for human preferences after being trained to elicit reasoning.
And that's what we're using right now. So we're going to talk about with 3Mini and O1. These have gone through these extra techniques that are designed for human preferences after being trained to elicit reasoning.
And that's what we're using right now. So we're going to talk about with 3Mini and O1. These have gone through these extra techniques that are designed for human preferences after being trained to elicit reasoning.