Andrej Karpathy
π€ SpeakerAppearances Over Time
Podcast Appearances
And so basically, it's very powerful in the forward pass because it's able to express...
very general computation as sort of something that looks like message passing.
You have nodes and they all store vectors.
And these nodes get to basically look at each other and each other's vectors.
And they get to communicate.
And basically nodes get to broadcast, hey, I'm looking for certain things.
And then other nodes get to broadcast, hey, these are the things I have.
Those are the keys and the values.
So it's not just attention.
Yeah, exactly.
Transformer is much more than just the attention component.
It's got many pieces architectural that went into it.
The residual connection, the way it's arranged, there's a multi-layer perceptron in there, the way it's stacked, and so on.
But basically, there's a message passing scheme where nodes get to look at each other, decide what's interesting, and then update each other.
And so I think when you get to the details of it, I think it's a very expressive function.
So it can express lots of different types of algorithms in forward pass.
Not only that, but the way it's designed with the residual connections, layer normalizations, the softmax attention and everything, it's also optimizable.
This is a really big deal because there's lots of computers that are powerful that you can't optimize or that are not easy to optimize using the techniques that we have, which is backpropication and gradient descent.
These are first-order methods, very simple optimizers, really.
And so...