Lennart Heim
👤 PersonAppearances Over Time
Podcast Appearances
We train these big pre-trained models, but now we do another thing.
We make these model reasons.
I think the best way to think about it is just like, people are not surprised.
If I give you an hour to think about something, you come up with a better solution than if I get your off-the-cuff answer, right?
Yeah.
And it's basically, well, this was always the case, right?
Like a couple of years ago, somebody said, oh, if I prompt the model and tell it to think step-by-step, it produces a better result.
All I told it to think step-by-step.
Again, not surprising.
Every teacher knows this.
If you have a good prompt, your students produce better results.
What we basically did with O1 and this whole new test time paradigm is we helped these models to reason.
We trained them all to reason.
So we have a big pre-trained model, again, spending millions on this.
Then we do another step on it.
We can call it post-training.
Some people might even call it mid-training now, where we do some type of reinforcement learning on reasoning.
This particularly works well for coding and math.
Why is that?
There are a little bit of truths.