Menu
Sign In Search Podcasts Charts People & Topics Add Podcast API Blog Pricing

Andrej Karpathy

๐Ÿ‘ค Speaker
3433 total appearances

Appearances Over Time

Podcast Appearances

Dwarkesh Podcast
Andrej Karpathy โ€” AGI is still a decade away

And then maybe you get an answer.

Dwarkesh Podcast
Andrej Karpathy โ€” AGI is still a decade away

And now you check the back of the book and you see, okay, the correct answer is this.

Dwarkesh Podcast
Andrej Karpathy โ€” AGI is still a decade away

And then you can see that, okay, this one, this one, and that one got the correct answer, but these other 97 of them didn't.

Dwarkesh Podcast
Andrej Karpathy โ€” AGI is still a decade away

So literally what reinforcement learning does is it goes to the ones that worked really well, and every single thing you did along the way, every single token gets up-weighted of, like, do more of this.

Dwarkesh Podcast
Andrej Karpathy โ€” AGI is still a decade away

The problem with that is, I mean, people will say that your estimator has high variance, but, I mean, it's just noisy.

Dwarkesh Podcast
Andrej Karpathy โ€” AGI is still a decade away

It's noisy.

Dwarkesh Podcast
Andrej Karpathy โ€” AGI is still a decade away

So basically, it kind of almost assumes that every single little piece of the solution that you made that right at the right answer was the correct thing to do, which is not true.

Dwarkesh Podcast
Andrej Karpathy โ€” AGI is still a decade away

Like, you may have gone down the wrong alleys

Dwarkesh Podcast
Andrej Karpathy โ€” AGI is still a decade away

until you write the right solution.

Dwarkesh Podcast
Andrej Karpathy โ€” AGI is still a decade away

Every single one of those incorrect things you did, as long as you got to the correct solution, will be up-weighted as do more of this.

Dwarkesh Podcast
Andrej Karpathy โ€” AGI is still a decade away

It's terrible.

Dwarkesh Podcast
Andrej Karpathy โ€” AGI is still a decade away

It's noise.

Dwarkesh Podcast
Andrej Karpathy โ€” AGI is still a decade away

You've done all this work only to find a single, at the end, you get a single number of like, oh, you did correct.

Dwarkesh Podcast
Andrej Karpathy โ€” AGI is still a decade away

And based on that, you weigh that entire trajectory as like up-weight or down-weight.

Dwarkesh Podcast
Andrej Karpathy โ€” AGI is still a decade away

And so the way I like to put it is you're sucking supervision through a straw because you've done all this work that could be a minute to roll out and you're like sucking the bits of supervision of the final reward signal through a straw and you're like putting it, you're like, you're basically like, yeah, you're broadcasting that across the entire trajectory and using that to upweigh or downweigh that trajectory.

Dwarkesh Podcast
Andrej Karpathy โ€” AGI is still a decade away

It's crazy.

Dwarkesh Podcast
Andrej Karpathy โ€” AGI is still a decade away

A human would never do this.

Dwarkesh Podcast
Andrej Karpathy โ€” AGI is still a decade away

Number one, a human would never do hundreds of rollouts.

Dwarkesh Podcast
Andrej Karpathy โ€” AGI is still a decade away

Number two, when a person sort of finds a solution, they will have a pretty complicated process of review of like, okay, I think these parts that I did well, these parts I did not do that well.

Dwarkesh Podcast
Andrej Karpathy โ€” AGI is still a decade away

I should probably do this or that.