Aman Sanger
๐ค SpeakerAppearances Over Time
Podcast Appearances
Yeah. Verification, it feels like it's best when you know for a fact that it's correct. And then it wouldn't be using a language model to verify. It would be using tests or formal systems. Or running the thing, too.
Yeah. Verification, it feels like it's best when you know for a fact that it's correct. And then it wouldn't be using a language model to verify. It would be using tests or formal systems. Or running the thing, too.
Yeah, yeah.
Yeah, yeah.
Yeah, yeah.
I think that that's the category that is, um, most likely to result in like massive gains.
I think that that's the category that is, um, most likely to result in like massive gains.
I think that that's the category that is, um, most likely to result in like massive gains.
Yeah, so RLHF is when the reward model you use is trained from some labels you've collected from humans giving feedback. I think this works if you have the ability to get a ton of human feedback for this kind of task that you care about.
Yeah, so RLHF is when the reward model you use is trained from some labels you've collected from humans giving feedback. I think this works if you have the ability to get a ton of human feedback for this kind of task that you care about.
Yeah, so RLHF is when the reward model you use is trained from some labels you've collected from humans giving feedback. I think this works if you have the ability to get a ton of human feedback for this kind of task that you care about.
RLAIF is interesting because you're kind of depending on, like this is actually kind of going to, it's depending on the constraint that verification is actually a decent bit easier than generation. Because it feels like, okay, what are you doing? Are you using this language model to look at the language model outputs and then prove the language model?
RLAIF is interesting because you're kind of depending on, like this is actually kind of going to, it's depending on the constraint that verification is actually a decent bit easier than generation. Because it feels like, okay, what are you doing? Are you using this language model to look at the language model outputs and then prove the language model?
RLAIF is interesting because you're kind of depending on, like this is actually kind of going to, it's depending on the constraint that verification is actually a decent bit easier than generation. Because it feels like, okay, what are you doing? Are you using this language model to look at the language model outputs and then prove the language model?
But no, it actually may work if the language model has a much easier time verifying some solution than it does generating it. Then you actually could perhaps get this kind of recursive loop. I don't think it's going to look exactly like that. The other thing you could do is... we kind of do is a little bit of a mix of RLA-IF and RLA-HF, where usually the model is actually quite correct.
But no, it actually may work if the language model has a much easier time verifying some solution than it does generating it. Then you actually could perhaps get this kind of recursive loop. I don't think it's going to look exactly like that. The other thing you could do is... we kind of do is a little bit of a mix of RLA-IF and RLA-HF, where usually the model is actually quite correct.
But no, it actually may work if the language model has a much easier time verifying some solution than it does generating it. Then you actually could perhaps get this kind of recursive loop. I don't think it's going to look exactly like that. The other thing you could do is... we kind of do is a little bit of a mix of RLA-IF and RLA-HF, where usually the model is actually quite correct.
And this is in the case of CursorTab, picking between two possible generations of what is the better one. And then it just needs a little bit of human nudging with only on the order of 50, 100 examples to kind of align that prior the model has with exactly what you want. It looks different than I think normal RLHF where you're usually training these reward models on tons of examples.
And this is in the case of CursorTab, picking between two possible generations of what is the better one. And then it just needs a little bit of human nudging with only on the order of 50, 100 examples to kind of align that prior the model has with exactly what you want. It looks different than I think normal RLHF where you're usually training these reward models on tons of examples.
And this is in the case of CursorTab, picking between two possible generations of what is the better one. And then it just needs a little bit of human nudging with only on the order of 50, 100 examples to kind of align that prior the model has with exactly what you want. It looks different than I think normal RLHF where you're usually training these reward models on tons of examples.