Aman Sanger
👤 PersonAppearances Over Time
Podcast Appearances
There's also a question of what level of intelligence do you need to determine if the thing is too hard for the four-level model? Maybe you need the O1-level model. It's really unclear.
There's also a question of what level of intelligence do you need to determine if the thing is too hard for the four-level model? Maybe you need the O1-level model. It's really unclear.
Um, well, it's weird because like test time compute, there's like a whole training strategy needed to get test time to compute to work. And the really, the other really weird thing about this is no one like outside of the big labs and maybe even just open AI, no one really knows how it works. Like there've been some really interesting papers that, uh, show hints of what they might be doing.
Um, well, it's weird because like test time compute, there's like a whole training strategy needed to get test time to compute to work. And the really, the other really weird thing about this is no one like outside of the big labs and maybe even just open AI, no one really knows how it works. Like there've been some really interesting papers that, uh, show hints of what they might be doing.
Um, well, it's weird because like test time compute, there's like a whole training strategy needed to get test time to compute to work. And the really, the other really weird thing about this is no one like outside of the big labs and maybe even just open AI, no one really knows how it works. Like there've been some really interesting papers that, uh, show hints of what they might be doing.
And so perhaps they're doing something with tree search using process reward models. But yeah, I just I think the issue is, we don't quite know exactly what it looks like.
And so perhaps they're doing something with tree search using process reward models. But yeah, I just I think the issue is, we don't quite know exactly what it looks like.
And so perhaps they're doing something with tree search using process reward models. But yeah, I just I think the issue is, we don't quite know exactly what it looks like.
So it would be hard to kind of comment on like where it fits in, I would put it in post training, but maybe like the compute spent for this kind of for getting test time compute to work for a model is going to dwarf pre training eventually.
So it would be hard to kind of comment on like where it fits in, I would put it in post training, but maybe like the compute spent for this kind of for getting test time compute to work for a model is going to dwarf pre training eventually.
So it would be hard to kind of comment on like where it fits in, I would put it in post training, but maybe like the compute spent for this kind of for getting test time compute to work for a model is going to dwarf pre training eventually.
It's fun to speculate.
It's fun to speculate.
It's fun to speculate.
Yeah. So one thing to do would be, I think you probably need to train a process reward model, which is, so maybe we can get into reward models and outcome reward models versus process reward models. Outcome reward models are the kind of traditional reward models that people are trained for language modeling. And it's just looking at the final thing.
Yeah. So one thing to do would be, I think you probably need to train a process reward model, which is, so maybe we can get into reward models and outcome reward models versus process reward models. Outcome reward models are the kind of traditional reward models that people are trained for language modeling. And it's just looking at the final thing.
Yeah. So one thing to do would be, I think you probably need to train a process reward model, which is, so maybe we can get into reward models and outcome reward models versus process reward models. Outcome reward models are the kind of traditional reward models that people are trained for language modeling. And it's just looking at the final thing.
So if you're doing some math problem, let's look at that final thing you've done, everything, and let's assign a grade to it, how likely we think, like what's the reward for this outcome. Process reward models instead try to grade the chain of thought.
So if you're doing some math problem, let's look at that final thing you've done, everything, and let's assign a grade to it, how likely we think, like what's the reward for this outcome. Process reward models instead try to grade the chain of thought.
So if you're doing some math problem, let's look at that final thing you've done, everything, and let's assign a grade to it, how likely we think, like what's the reward for this outcome. Process reward models instead try to grade the chain of thought.