Aman Sanger
👤 PersonAppearances Over Time
Podcast Appearances
And so OpenAI had some preliminary paper on this, I think, last summer, where they use human labelers to get this pretty large, several hundred thousand data set of grading chains of thought. Ultimately, it feels like I haven't seen anything interesting in the ways that people use process reward models outside of just using it as a means of affecting how we choose between a bunch of samples.
And so OpenAI had some preliminary paper on this, I think, last summer, where they use human labelers to get this pretty large, several hundred thousand data set of grading chains of thought. Ultimately, it feels like I haven't seen anything interesting in the ways that people use process reward models outside of just using it as a means of affecting how we choose between a bunch of samples.
And so OpenAI had some preliminary paper on this, I think, last summer, where they use human labelers to get this pretty large, several hundred thousand data set of grading chains of thought. Ultimately, it feels like I haven't seen anything interesting in the ways that people use process reward models outside of just using it as a means of affecting how we choose between a bunch of samples.
So like what people do in all these papers is they sample a bunch of outputs from the language model and then use the process reward models to grade all those generations alongside maybe some other heuristics and then use that to choose the best answer. The really interesting thing that people think might work and people want to work is tree search with these process reward models.
So like what people do in all these papers is they sample a bunch of outputs from the language model and then use the process reward models to grade all those generations alongside maybe some other heuristics and then use that to choose the best answer. The really interesting thing that people think might work and people want to work is tree search with these process reward models.
So like what people do in all these papers is they sample a bunch of outputs from the language model and then use the process reward models to grade all those generations alongside maybe some other heuristics and then use that to choose the best answer. The really interesting thing that people think might work and people want to work is tree search with these process reward models.
Because if you really can grade every single step of the chain of thought, then you can kind of branch out and explore multiple paths of this chain of thought. And then use these process reward models to evaluate how good is this branch that you're taking.
Because if you really can grade every single step of the chain of thought, then you can kind of branch out and explore multiple paths of this chain of thought. And then use these process reward models to evaluate how good is this branch that you're taking.
Because if you really can grade every single step of the chain of thought, then you can kind of branch out and explore multiple paths of this chain of thought. And then use these process reward models to evaluate how good is this branch that you're taking.
And like the interesting work that I think has been done is figuring out how to properly train the process or the interesting work that has been open sourced. And people I think talk about is how to train the process reward models, maybe in a more automated way.
And like the interesting work that I think has been done is figuring out how to properly train the process or the interesting work that has been open sourced. And people I think talk about is how to train the process reward models, maybe in a more automated way.
And like the interesting work that I think has been done is figuring out how to properly train the process or the interesting work that has been open sourced. And people I think talk about is how to train the process reward models, maybe in a more automated way.
I could be wrong here, could not be mentioning something because I haven't seen anything super that seems to work really well for using the process reward models creatively to do tree search and code.
I could be wrong here, could not be mentioning something because I haven't seen anything super that seems to work really well for using the process reward models creatively to do tree search and code.
I could be wrong here, could not be mentioning something because I haven't seen anything super that seems to work really well for using the process reward models creatively to do tree search and code.
But it has these significant limitations. Even barring capabilities, it does not stream. And that means it's really, really painful to use for things where you want to supervise the output. And instead, you're just waiting for the wall of text to show up. Also, it does feel like the early innings of test time compute and search, where it's just very, very much a v0.
But it has these significant limitations. Even barring capabilities, it does not stream. And that means it's really, really painful to use for things where you want to supervise the output. And instead, you're just waiting for the wall of text to show up. Also, it does feel like the early innings of test time compute and search, where it's just very, very much a v0.
But it has these significant limitations. Even barring capabilities, it does not stream. And that means it's really, really painful to use for things where you want to supervise the output. And instead, you're just waiting for the wall of text to show up. Also, it does feel like the early innings of test time compute and search, where it's just very, very much a v0.
And there's so many things that... like don't feel quite right. And I suspect in parallel to people increasing the amount of pre-training data and the size of the models and pre-training and finding tricks there, you'll now have this other thread of getting search to work better and better.
And there's so many things that... like don't feel quite right. And I suspect in parallel to people increasing the amount of pre-training data and the size of the models and pre-training and finding tricks there, you'll now have this other thread of getting search to work better and better.