Menu
Sign In Search Podcasts Charts People & Topics Add Podcast API Blog Pricing

Andrej Karpathy

๐Ÿ‘ค Speaker
3419 total appearances

Appearances Over Time

Podcast Appearances

Dwarkesh Podcast
Andrej Karpathy โ€” AGI is still a decade away

We were training with reinforcement learning against that reward function, and it worked really well, and then suddenly the reward became extremely large.

Dwarkesh Podcast
Andrej Karpathy โ€” AGI is still a decade away

It was a massive jump, and it did perfect.

Dwarkesh Podcast
Andrej Karpathy โ€” AGI is still a decade away

And you're looking at it like, wow, this means the student is perfect in all these problems.

Dwarkesh Podcast
Andrej Karpathy โ€” AGI is still a decade away

It's fully solved math.

Dwarkesh Podcast
Andrej Karpathy โ€” AGI is still a decade away

But actually what's happening is that when you look at the completions that you're getting from the model, they are complete nonsense.

Dwarkesh Podcast
Andrej Karpathy โ€” AGI is still a decade away

They start out okay, and then they change to da-da-da-da-da-da-da.

Dwarkesh Podcast
Andrej Karpathy โ€” AGI is still a decade away

So it's just like, oh, okay, let's take two plus three, and we do this and this, and then da-da-da-da-da-da-da-da.

Dwarkesh Podcast
Andrej Karpathy โ€” AGI is still a decade away

And you're looking at it and it's like, this is crazy.

Dwarkesh Podcast
Andrej Karpathy โ€” AGI is still a decade away

How is it getting a reward of one or 100%?

Dwarkesh Podcast
Andrej Karpathy โ€” AGI is still a decade away

And you look at the LLM judge and it turns out that the, the, the, the, the is an adversarial examples for the model and it assigns 100% probability to it.

Dwarkesh Podcast
Andrej Karpathy โ€” AGI is still a decade away

And it's just because this is an out-of-sample example to the LLM.

Dwarkesh Podcast
Andrej Karpathy โ€” AGI is still a decade away

It's never seen it during training, and you're in pure generalization land.

Dwarkesh Podcast
Andrej Karpathy โ€” AGI is still a decade away

It's never seen it during training, and in the pure generalization land, you can find these examples that break it.

Dwarkesh Podcast
Andrej Karpathy โ€” AGI is still a decade away

Not even that.

Dwarkesh Podcast
Andrej Karpathy โ€” AGI is still a decade away

Prompt injection is way too fancy.

Dwarkesh Podcast
Andrej Karpathy โ€” AGI is still a decade away

You're finding adversarial examples, as they're called.

Dwarkesh Podcast
Andrej Karpathy โ€” AGI is still a decade away

These are nonsensical solutions that are obviously wrong, but the model thinks they're amazing.

Dwarkesh Podcast
Andrej Karpathy โ€” AGI is still a decade away

Yeah.

Dwarkesh Podcast
Andrej Karpathy โ€” AGI is still a decade away

I think the labs are probably doing all that.

Dwarkesh Podcast
Andrej Karpathy โ€” AGI is still a decade away

Like, okay, so the obvious thing is like the should not get 100% reward.