Andrej Karpathy
π€ SpeakerAppearances Over Time
Podcast Appearances
You also need it to be optimisable.
And then lastly, you want it to run efficiently in our hardware.
Our hardware is a massive throughput machine like GPUs.
They prefer lots of parallelism.
So you don't want to do lots of sequential operations.
You want to do a lot of operations serially.
And the Transformer is designed with that in mind as well.
And so it's designed for our hardware and it's designed to both be very expressive in a forward pass, but also very optimisable in the backward pass.
Right.
Think of it as, so basically a transformer is a series of blocks, right?
And these blocks have attention and a little multi-layer perceptron.
And so you go off into a block and you come back to this residual pathway.
And then you go off and you come back.
And then you have a number of layers arranged sequentially.
And so the way to look at it, I think, is because of the residual pathway in the backward pass, the gradients sort of flow along it uninterrupted because addition distributes the gradient equally to all of its branches.
So the gradient from the supervision at the top just floats directly to the first layer.
And all the residual connections are arranged so that in the beginning during initialization, they contribute nothing to the residual pathway.
Mm-hmm.
So what it kind of looks like is, imagine the transformer is kind of like a Python function, like a dev.
And you get to do various kinds of lines of code.