Andrej Karpathy
๐ค SpeakerAppearances Over Time
Podcast Appearances
And so I think when you get to the details of it, I think it's a very expressive function.
So it can express lots of different types of algorithms in forward pass.
Not only that, but the way it's designed with the residual connections, layer normalizations, the softmax attention and everything, it's also optimizable.
This is a really big deal because there's lots of computers that are powerful that you can't optimize or that are not easy to optimize using the techniques that we have, which is backpropication and gradient descent.
These are first-order methods, very simple optimizers, really.
And so...
You also need it to be optimisable.
And then lastly, you want it to run efficiently in our hardware.
Our hardware is a massive throughput machine like GPUs.
They prefer lots of parallelism.
So you don't want to do lots of sequential operations.
You want to do a lot of operations serially.
And the Transformer is designed with that in mind as well.
And so it's designed for our hardware and it's designed to both be very expressive in a forward pass, but also very optimisable in the backward pass.
Right.
Think of it as, so basically a transformer is a series of blocks, right?
And these blocks have attention and a little multi-layer perceptron.
And so you go off into a block and you come back to this residual pathway.
And then you go off and you come back.
And then you have a number of layers arranged sequentially.