Andrej Karpathy
๐ค SpeakerAppearances Over Time
Podcast Appearances
It's not obvious how you do it.
Lots of labs, I think, are trying to do it with these LLM judges.
So basically, you get LLMs to try to do it.
So you prompt an LLM, hey, look at a partial solution of a student.
How well do you think they're doing if the answer is this?
And they try to tune the prompt.
The reason that I think this is kind of tricky is quite subtle.
And it's the fact that anytime you use an LLM to assign a reward, those LLMs are giant things with billions of parameters and they're gameable.
And if you're reinforcement learning with respect to them, you will find adversarial examples for your LLM judges almost guaranteed.
You can't do this for too long.
You do maybe 10 steps or 20 steps, maybe it will work, but you can't do 100 or 1,000 because it's not obvious.
Because I understand it's not obvious, but basically the model will find little cracks,
it will find all these spurious things in the nooks and crannies of the giant model and find a way to cheat it.
So one example that's prominently in my mind is, I think this was probably public, but basically, if you're using an element judge for a reward, so you just give it a solution from a student and ask it if the student will or not,
We were training with reinforcement learning against that reward function, and it worked really well, and then suddenly the reward became extremely large.
It was a massive jump, and it did perfect.
And you're looking at it like, wow, this means the student is perfect in all these problems.
It's fully solved math.
But actually what's happening is that when you look at the completions that you're getting from the model, they are complete nonsense.
They start out okay, and then they change to da-da-da-da-da-da-da.