Andrej Karpathy
๐ค SpeakerAppearances Over Time
Podcast Appearances
And so the way to look at it, I think, is because of the residual pathway in the backward pass, the gradients sort of flow along it uninterrupted because addition distributes the gradient equally to all of its branches.
So the gradient from the supervision at the top just floats directly to the first layer.
And all the residual connections are arranged so that in the beginning during initialization, they contribute nothing to the residual pathway.
Mm-hmm.
So what it kind of looks like is, imagine the transformer is kind of like a Python function, like a dev.
And you get to do various kinds of lines of code.
Say you have a 100 layers deep transformer, typically they would be much shorter, say 20.
So you have 20 lines of code and you can do something in them.
And so during the optimization, basically what it looks like is first you optimize the first line of code, and then the second line of code can kick in, and the third line of code can kick in.
And I feel like because of the residual pathway and the dynamics of the optimization, you can learn a very short algorithm that gets the approximate answer, but then the other layers can kick in and start to create a contribution.
And at the end of it, you're optimizing over an algorithm that is 20 lines of code.
except these lines of code are very complex because it's an entire block of a transformer.
You can do a lot in there.
What's really interesting is that this transformer architecture actually has been remarkably resilient.
Basically, the transformer that came out in 2016 is the transformer you would use today, except you reshuffle some of the layer norms.
The related normalizations have been reshuffled to a pre-norm formulation.
And so it's been remarkably stable, but there's a lot of bells and whistles that people have attached to it and tried to improve it.
I do think that basically it's a big step in simultaneously optimizing for lots of properties of a desirable neural network architecture.
And I think people have been trying to change it, but it's proven remarkably resilient.
But I do think that there should be even better architectures potentially.