Aman Sanger
👤 PersonAppearances Over Time
Podcast Appearances
I think that that's the category that is, um, most likely to result in like massive gains.
I think that that's the category that is, um, most likely to result in like massive gains.
I think that that's the category that is, um, most likely to result in like massive gains.
Yeah, so RLHF is when the reward model you use is trained from some labels you've collected from humans giving feedback. I think this works if you have the ability to get a ton of human feedback for this kind of task that you care about.
Yeah, so RLHF is when the reward model you use is trained from some labels you've collected from humans giving feedback. I think this works if you have the ability to get a ton of human feedback for this kind of task that you care about.
Yeah, so RLHF is when the reward model you use is trained from some labels you've collected from humans giving feedback. I think this works if you have the ability to get a ton of human feedback for this kind of task that you care about.
RLAIF is interesting because you're kind of depending on, like this is actually kind of going to, it's depending on the constraint that verification is actually a decent bit easier than generation. Because it feels like, okay, what are you doing? Are you using this language model to look at the language model outputs and then prove the language model?
RLAIF is interesting because you're kind of depending on, like this is actually kind of going to, it's depending on the constraint that verification is actually a decent bit easier than generation. Because it feels like, okay, what are you doing? Are you using this language model to look at the language model outputs and then prove the language model?
RLAIF is interesting because you're kind of depending on, like this is actually kind of going to, it's depending on the constraint that verification is actually a decent bit easier than generation. Because it feels like, okay, what are you doing? Are you using this language model to look at the language model outputs and then prove the language model?
But no, it actually may work if the language model has a much easier time verifying some solution than it does generating it. Then you actually could perhaps get this kind of recursive loop. I don't think it's going to look exactly like that. The other thing you could do is... we kind of do is a little bit of a mix of RLA-IF and RLA-HF, where usually the model is actually quite correct.
But no, it actually may work if the language model has a much easier time verifying some solution than it does generating it. Then you actually could perhaps get this kind of recursive loop. I don't think it's going to look exactly like that. The other thing you could do is... we kind of do is a little bit of a mix of RLA-IF and RLA-HF, where usually the model is actually quite correct.
But no, it actually may work if the language model has a much easier time verifying some solution than it does generating it. Then you actually could perhaps get this kind of recursive loop. I don't think it's going to look exactly like that. The other thing you could do is... we kind of do is a little bit of a mix of RLA-IF and RLA-HF, where usually the model is actually quite correct.
And this is in the case of CursorTab, picking between two possible generations of what is the better one. And then it just needs a little bit of human nudging with only on the order of 50, 100 examples to kind of align that prior the model has with exactly what you want. It looks different than I think normal RLHF where you're usually training these reward models on tons of examples.
And this is in the case of CursorTab, picking between two possible generations of what is the better one. And then it just needs a little bit of human nudging with only on the order of 50, 100 examples to kind of align that prior the model has with exactly what you want. It looks different than I think normal RLHF where you're usually training these reward models on tons of examples.
And this is in the case of CursorTab, picking between two possible generations of what is the better one. And then it just needs a little bit of human nudging with only on the order of 50, 100 examples to kind of align that prior the model has with exactly what you want. It looks different than I think normal RLHF where you're usually training these reward models on tons of examples.
My intuition would just say, yeah, it should be. This is kind of going back to like if you if you believe p does not equal np then there's this massive class of problems that are much much easier to verify given a proof than actually proving it i wonder if the same thing will prove p not equal to np or p equal to np that would be that would be really cool
My intuition would just say, yeah, it should be. This is kind of going back to like if you if you believe p does not equal np then there's this massive class of problems that are much much easier to verify given a proof than actually proving it i wonder if the same thing will prove p not equal to np or p equal to np that would be that would be really cool
My intuition would just say, yeah, it should be. This is kind of going back to like if you if you believe p does not equal np then there's this massive class of problems that are much much easier to verify given a proof than actually proving it i wonder if the same thing will prove p not equal to np or p equal to np that would be that would be really cool
I feel like I have much more to do there. It felt like the path to get to IMO was a little bit more clear because it already could get a few IMO problems. And there are a bunch of like there's a bunch of low hanging fruit given the literature at the time of like what what tactics people could take. I think I'm one much less versed in the space of theorem proving now.
I feel like I have much more to do there. It felt like the path to get to IMO was a little bit more clear because it already could get a few IMO problems. And there are a bunch of like there's a bunch of low hanging fruit given the literature at the time of like what what tactics people could take. I think I'm one much less versed in the space of theorem proving now.