Menu
Sign In Search Podcasts Charts People & Topics Add Podcast API Pricing

Dwarkesh Patel

👤 Person
12212 total appearances

Appearances Over Time

Podcast Appearances

Dwarkesh Podcast
Andrej Karpathy — AGI is still a decade away

The situation in which these models seem the most intelligent, in which they are like, I talk to them and I'm like, wow, there's really something on the other end that's responding to me thinking about things.

Dwarkesh Podcast
Andrej Karpathy — AGI is still a decade away

If it like makes a mistake, it's like, oh, wait, that's actually the wrong way to think about it.

Dwarkesh Podcast
Andrej Karpathy — AGI is still a decade away

I'm backing up.

Dwarkesh Podcast
Andrej Karpathy — AGI is still a decade away

All that is happening in context.

Dwarkesh Podcast
Andrej Karpathy — AGI is still a decade away

That's where I feel like the real intelligence you can like visibly see.

Dwarkesh Podcast
Andrej Karpathy — AGI is still a decade away

And that in context learning process is developed by gradient descent on pre-training, right?

Dwarkesh Podcast
Andrej Karpathy — AGI is still a decade away

Like it spontaneously meta learns in context learning.

Dwarkesh Podcast
Andrej Karpathy — AGI is still a decade away

But the in context learning itself is not gradient descent in the same way that our lifetime intelligence as humans to be able to do things is conditioned by evolution.

Dwarkesh Podcast
Andrej Karpathy — AGI is still a decade away

But our actual learning during our lifetime is like happening through some other process.

Dwarkesh Podcast
Andrej Karpathy — AGI is still a decade away

I actually don't fully agree with that, but you should continue with that.

Dwarkesh Podcast
Andrej Karpathy — AGI is still a decade away

Okay.

Dwarkesh Podcast
Andrej Karpathy — AGI is still a decade away

Actually, then I'm very curious to understand how that analogy breaks down.

Dwarkesh Podcast
Andrej Karpathy — AGI is still a decade away

So then it's worth thinking about, okay, if both of them are implementing gradient descent, sorry, if in-context learning and pre-training are both implementing something like gradient descent, why does it feel like in-context learning actually we're getting to this like continual learning, real intelligence-like thing, whereas you don't get the analogous feeling just from pre-training?

Dwarkesh Podcast
Andrej Karpathy — AGI is still a decade away

At least you could argue that.

Dwarkesh Podcast
Andrej Karpathy — AGI is still a decade away

And so if it's the same algorithm, what could be different?

Dwarkesh Podcast
Andrej Karpathy — AGI is still a decade away

Well, one way you can think about it is how much information

Dwarkesh Podcast
Andrej Karpathy — AGI is still a decade away

does the model store per information it receives from training?

Dwarkesh Podcast
Andrej Karpathy — AGI is still a decade away

And if you look at pre-training, if you look at Llama 3, for example, I think it's trained on 15 trillion tokens.

Dwarkesh Podcast
Andrej Karpathy — AGI is still a decade away

And if you look at a 70B model, that would be the equivalent of 0.07 bits per token in that it sees in pre-training in terms of the information in the weights of the model compared to the tokens it reads.

Dwarkesh Podcast
Andrej Karpathy — AGI is still a decade away

Whereas if you look at the KV cache and how it grows per additional token and in-context learning, it's like 320 kilobytes.