Menu
Sign In Search Podcasts Charts People & Topics Add Podcast API Blog Pricing

Andrej Karpathy

๐Ÿ‘ค Speaker
3433 total appearances

Appearances Over Time

Podcast Appearances

Dwarkesh Podcast
Andrej Karpathy โ€” AGI is still a decade away

It's not obvious how you do it.

Dwarkesh Podcast
Andrej Karpathy โ€” AGI is still a decade away

Lots of labs, I think, are trying to do it with these LLM judges.

Dwarkesh Podcast
Andrej Karpathy โ€” AGI is still a decade away

So basically, you get LLMs to try to do it.

Dwarkesh Podcast
Andrej Karpathy โ€” AGI is still a decade away

So you prompt an LLM, hey, look at a partial solution of a student.

Dwarkesh Podcast
Andrej Karpathy โ€” AGI is still a decade away

How well do you think they're doing if the answer is this?

Dwarkesh Podcast
Andrej Karpathy โ€” AGI is still a decade away

And they try to tune the prompt.

Dwarkesh Podcast
Andrej Karpathy โ€” AGI is still a decade away

The reason that I think this is kind of tricky is quite subtle.

Dwarkesh Podcast
Andrej Karpathy โ€” AGI is still a decade away

And it's the fact that anytime you use an LLM to assign a reward, those LLMs are giant things with billions of parameters and they're gameable.

Dwarkesh Podcast
Andrej Karpathy โ€” AGI is still a decade away

And if you're reinforcement learning with respect to them, you will find adversarial examples for your LLM judges almost guaranteed.

Dwarkesh Podcast
Andrej Karpathy โ€” AGI is still a decade away

You can't do this for too long.

Dwarkesh Podcast
Andrej Karpathy โ€” AGI is still a decade away

You do maybe 10 steps or 20 steps, maybe it will work, but you can't do 100 or 1,000 because it's not obvious.

Dwarkesh Podcast
Andrej Karpathy โ€” AGI is still a decade away

Because I understand it's not obvious, but basically the model will find little cracks,

Dwarkesh Podcast
Andrej Karpathy โ€” AGI is still a decade away

it will find all these spurious things in the nooks and crannies of the giant model and find a way to cheat it.

Dwarkesh Podcast
Andrej Karpathy โ€” AGI is still a decade away

So one example that's prominently in my mind is, I think this was probably public, but basically, if you're using an element judge for a reward, so you just give it a solution from a student and ask it if the student will or not,

Dwarkesh Podcast
Andrej Karpathy โ€” AGI is still a decade away

We were training with reinforcement learning against that reward function, and it worked really well, and then suddenly the reward became extremely large.

Dwarkesh Podcast
Andrej Karpathy โ€” AGI is still a decade away

It was a massive jump, and it did perfect.

Dwarkesh Podcast
Andrej Karpathy โ€” AGI is still a decade away

And you're looking at it like, wow, this means the student is perfect in all these problems.

Dwarkesh Podcast
Andrej Karpathy โ€” AGI is still a decade away

It's fully solved math.

Dwarkesh Podcast
Andrej Karpathy โ€” AGI is still a decade away

But actually what's happening is that when you look at the completions that you're getting from the model, they are complete nonsense.

Dwarkesh Podcast
Andrej Karpathy โ€” AGI is still a decade away

They start out okay, and then they change to da-da-da-da-da-da-da.