Andrej Karpathy
๐ค SpeakerAppearances Over Time
Podcast Appearances
I'm going to tell you at every single step of the way how well you're doing.
And this is basically the reason we don't have that.
It's tricky how you do that properly because you have partial solutions and you don't know how to assign credit.
So when you get the right answer, it's just an equality match to the answer.
Very simple to implement.
If you're doing basically process supervision, how do you assign, in an automatable way, partial credit assignment?
It's not obvious how you do it.
Lots of labs, I think, are trying to do it with these LLM judges.
So basically, you get LLMs to try to do it.
So you prompt an LLM, hey, look at a partial solution of a student.
How well do you think they're doing if the answer is this?
And they try to tune the prompt.
The reason that I think this is kind of tricky is quite subtle.
And it's the fact that anytime you use an LLM to assign a reward, those LLMs are giant things with billions of parameters and they're gameable.
And if you're reinforcement learning with respect to them, you will find adversarial examples for your LLM judges almost guaranteed.
You can't do this for too long.
You do maybe 10 steps or 20 steps, maybe it will work, but you can't do 100 or 1,000 because it's not obvious.
Because I understand it's not obvious, but basically the model will find little cracks,
it will find all these spurious things in the nooks and crannies of the giant model and find a way to cheat it.
So one example that's prominently in my mind is, I think this was probably public, but basically, if you're using an element judge for a reward, so you just give it a solution from a student and ask it if the student will or not,