Andrej Karpathy
๐ค SpeakerAppearances Over Time
Podcast Appearances
If it was too grand, it would overpromise and then underdeliver potentially.
So you want to just meme your way to greatness.
You want to have a general purpose computer that you can train on arbitrary problems, like say the task of next word prediction or detecting if there's a cat in an image or something like that.
And you want to train this computer, so you want to set its weights.
And I think there's a number of design criteria that sort of overlap in the transformer simultaneously that made it very successful.
And I think the authors were kind of deliberately trying to make this a really powerful architecture.
And so basically, it's very powerful in the forward pass because it's able to express...
very general computation as sort of something that looks like message passing.
You have nodes and they all store vectors.
And these nodes get to basically look at each other and each other's vectors.
And they get to communicate.
And basically nodes get to broadcast, hey, I'm looking for certain things.
And then other nodes get to broadcast, hey, these are the things I have.
Those are the keys and the values.
So it's not just attention.
Yeah, exactly.
Transformer is much more than just the attention component.
It's got many pieces architectural that went into it.
The residual connection, the way it's arranged, there's a multi-layer perceptron in there, the way it's stacked, and so on.
But basically, there's a message passing scheme where nodes get to look at each other, decide what's interesting, and then update each other.