Andrej Karpathy
๐ค SpeakerAppearances Over Time
Podcast Appearances
So it's just like, oh, okay, let's take two plus three, and we do this and this, and then da-da-da-da-da-da-da-da.
And you're looking at it and it's like, this is crazy.
How is it getting a reward of one or 100%?
And you look at the LLM judge and it turns out that the, the, the, the, the is an adversarial examples for the model and it assigns 100% probability to it.
And it's just because this is an out-of-sample example to the LLM.
It's never seen it during training, and you're in pure generalization land.
It's never seen it during training, and in the pure generalization land, you can find these examples that break it.
Not even that.
Prompt injection is way too fancy.
You're finding adversarial examples, as they're called.
These are nonsensical solutions that are obviously wrong, but the model thinks they're amazing.
Yeah.
I think the labs are probably doing all that.
Like, okay, so the obvious thing is like the should not get 100% reward.
okay, well, take the, the, the, the, put in the training set of the LLM judge and say, this is not 100%, this is 0%.
You can do this.
But every time you do this, you get a new LLM and it still has adversarial examples.
There's infinity adversarial examples.
And I think probably if you iterate this a few times, it'll probably be harder and harder to find adversarial examples.
But I'm not 100% sure because this thing has a trillion parameters or whatnot.