Andrej Karpathy
๐ค SpeakerAppearances Over Time
Podcast Appearances
We were training with reinforcement learning against that reward function, and it worked really well, and then suddenly the reward became extremely large.
It was a massive jump, and it did perfect.
And you're looking at it like, wow, this means the student is perfect in all these problems.
It's fully solved math.
But actually what's happening is that when you look at the completions that you're getting from the model, they are complete nonsense.
They start out okay, and then they change to da-da-da-da-da-da-da.
So it's just like, oh, okay, let's take two plus three, and we do this and this, and then da-da-da-da-da-da-da-da.
And you're looking at it and it's like, this is crazy.
How is it getting a reward of one or 100%?
And you look at the LLM judge and it turns out that the, the, the, the, the is an adversarial examples for the model and it assigns 100% probability to it.
And it's just because this is an out-of-sample example to the LLM.
It's never seen it during training, and you're in pure generalization land.
It's never seen it during training, and in the pure generalization land, you can find these examples that break it.
Not even that.
Prompt injection is way too fancy.
You're finding adversarial examples, as they're called.
These are nonsensical solutions that are obviously wrong, but the model thinks they're amazing.
Yeah.
I think the labs are probably doing all that.
Like, okay, so the obvious thing is like the should not get 100% reward.