Andrej Karpathy
π€ SpeakerAppearances Over Time
Podcast Appearances
bit of a general-purpose computer that is also trainable and very efficient to run on our hardware.
And so this paper came out in 2016, I want to say.
Yeah, I'm not sure if the authors were aware of the impact that that paper would go on to have.
Probably they weren't.
But I think they were aware of some of the motivations and design decisions behind the Transformer, and they chose not to, I think, expand on it in that way in the paper.
And so I think they had an idea that there was more...
than just the surface of just like, oh, we're just doing translation and here's a better architecture.
You're not just doing translation.
This is like a really cool, differentiable, optimizable, efficient computer that you've proposed.
And maybe they didn't have all of that foresight, but I think it's really interesting.
Attention is all you need.
Yeah, it's like a meme or something, basically.
Honestly, yeah, there is an element of me that honestly agrees with you and prefers it this way.
Yes.
If it was too grand, it would overpromise and then underdeliver potentially.
So you want to just meme your way to greatness.
You want to have a general purpose computer that you can train on arbitrary problems, like say the task of next word prediction or detecting if there's a cat in an image or something like that.
And you want to train this computer, so you want to set its weights.
And I think there's a number of design criteria that sort of overlap in the transformer simultaneously that made it very successful.
And I think the authors were kind of deliberately trying to make this a really powerful architecture.