Menu
Sign In Search Podcasts Charts People & Topics Add Podcast API Blog Pricing

Andrej Karpathy

๐Ÿ‘ค Speaker
3419 total appearances

Appearances Over Time

Podcast Appearances

Dwarkesh Podcast
Andrej Karpathy โ€” AGI is still a decade away

because there's lots of examples of it in the training sets of these models.

Dwarkesh Podcast
Andrej Karpathy โ€” AGI is still a decade away

So there's features of things where the models will do very well.

Dwarkesh Podcast
Andrej Karpathy โ€” AGI is still a decade away

I would say NanoChat is not an example of this, because it's a fairly unique repository.

Dwarkesh Podcast
Andrej Karpathy โ€” AGI is still a decade away

There's not that much code, I think, in the way that I've structured it.

Dwarkesh Podcast
Andrej Karpathy โ€” AGI is still a decade away

And it's not boilerplate code.

Dwarkesh Podcast
Andrej Karpathy โ€” AGI is still a decade away

It's like actually like intellectually intense code almost.

Dwarkesh Podcast
Andrej Karpathy โ€” AGI is still a decade away

And everything has to be very precisely arranged.

Dwarkesh Podcast
Andrej Karpathy โ€” AGI is still a decade away

And the models are always trying to, they kept trying to, I mean, they have so many cognitive deficits, right?

Dwarkesh Podcast
Andrej Karpathy โ€” AGI is still a decade away

So one example, they keep trying to, they keep misunderstanding the code because they have too much memory from all the typical ways of doing things on the internet that I just wasn't adopting.

Dwarkesh Podcast
Andrej Karpathy โ€” AGI is still a decade away

So the models, for example, I mean, I don't know if I want to get into the full details, but they keep thinking I'm writing normal code and I'm not.

Dwarkesh Podcast
Andrej Karpathy โ€” AGI is still a decade away

Maybe one example.

Dwarkesh Podcast
Andrej Karpathy โ€” AGI is still a decade away

Maybe one example is, so the way to synchronize, so we have eight GPUs that are all doing forward backwards.

Dwarkesh Podcast
Andrej Karpathy โ€” AGI is still a decade away

The way to synchronize gradients between them is to use a distributed data parallel container of PyTorch, which automatically does all the, as you're doing the backward, it will start communicating and synchronizing gradients.

Dwarkesh Podcast
Andrej Karpathy โ€” AGI is still a decade away

I didn't use DDP because I didn't want to use it because it's not necessary.

Dwarkesh Podcast
Andrej Karpathy โ€” AGI is still a decade away

So I threw it out.

Dwarkesh Podcast
Andrej Karpathy โ€” AGI is still a decade away

And I basically wrote my own synchronization routine that's inside the step of the optimizer.

Dwarkesh Podcast
Andrej Karpathy โ€” AGI is still a decade away

And so the models were trying to get me to use the DDP container, and they were very concerned about, okay, this gets way too technical, but I wasn't using that container because I don't need it, and I have a custom implementation of something like it.

Dwarkesh Podcast
Andrej Karpathy โ€” AGI is still a decade away

Yeah, they couldn't get past that.

Dwarkesh Podcast
Andrej Karpathy โ€” AGI is still a decade away

and then um they kept trying to like mess up the style like they're way too over defensive they make all these try catch statements they keep trying to make a production code base and i have a bunch of assumptions in my code and it's okay and uh and it's just like i don't need all this extra stuff in there and so i just kind of feel like they're bloating the code base they're bloating the complexity they keep misunderstanding they're using deprecated apis a bunch of times so it's total mess um

Dwarkesh Podcast
Andrej Karpathy โ€” AGI is still a decade away

and it's just not that useful.