Andrej Karpathy

And normally when you would run linear regression, you have a small gradient descent optimizer that basically looks at x, y, looks at an error, calculates the gradient of the weights, and does the update a few times.

952.291 View full episode →

Dwarkesh Podcast

Andrej Karpathy — AGI is still a decade away

It just turns out that when they looked at the weights of that in-context learning algorithm,

962.324 View full episode →

Dwarkesh Podcast

Andrej Karpathy — AGI is still a decade away

they actually found some analogies to gradient descent mechanics.

965.888 View full episode →

Dwarkesh Podcast

Andrej Karpathy — AGI is still a decade away

In fact, I think even the paper was stronger because they actually hard-coded the weights of a neural network to do gradient descent through attention and all the internals of the neural network.

970.74 View full episode →

Dwarkesh Podcast

Andrej Karpathy — AGI is still a decade away

So I guess that's just my only pushback is that who knows how in-context learning works, but I actually think that it's probably doing a little bit of some kind of funky gradient descent internally, and that I think that that's possible.

982.088 View full episode →

Dwarkesh Podcast

Andrej Karpathy — AGI is still a decade away

So I guess I was only pushing back on you're saying it's not doing in-context learning.

992.749 View full episode →

Dwarkesh Podcast

Andrej Karpathy — AGI is still a decade away

Who knows what it's doing, but it's probably maybe doing something similar to it, but we don't know.

996.477 View full episode →

Dwarkesh Podcast

Andrej Karpathy — AGI is still a decade away

I think I kind of agree.

1066.295 View full episode →

Dwarkesh Podcast

Andrej Karpathy — AGI is still a decade away

I mean, the way I usually put this is that anything that happens during the training of the neural network, the knowledge is only kind of like a hazy recollection of what happened in the training time.

1067.637 View full episode →

Dwarkesh Podcast

Andrej Karpathy — AGI is still a decade away

And that's because the compression is dramatic.

1077.115 View full episode →

Dwarkesh Podcast

Andrej Karpathy — AGI is still a decade away

You're taking 15 trillion tokens and you're compressing it to just your final network of a few billion parameters.

1078.717 View full episode →

Dwarkesh Podcast

Andrej Karpathy — AGI is still a decade away

So obviously it's a massive amount of compression going on.