Andrej Karpathy
π€ SpeakerAppearances Over Time
Podcast Appearances
Say you have a 100 layers deep transformer, typically they would be much shorter, say 20.
So you have 20 lines of code and you can do something in them.
And so during the optimization, basically what it looks like is first you optimize the first line of code, and then the second line of code can kick in, and the third line of code can kick in.
And I feel like because of the residual pathway and the dynamics of the optimization, you can learn a very short algorithm that gets the approximate answer, but then the other layers can kick in and start to create a contribution.
And at the end of it, you're optimizing over an algorithm that is 20 lines of code.
except these lines of code are very complex because it's an entire block of a transformer.
You can do a lot in there.
What's really interesting is that this transformer architecture actually has been remarkably resilient.
Basically, the transformer that came out in 2016 is the transformer you would use today, except you reshuffle some of the layer norms.
The related normalizations have been reshuffled to a pre-norm formulation.
And so it's been remarkably stable, but there's a lot of bells and whistles that people have attached to it and tried to improve it.
I do think that basically it's a big step in simultaneously optimizing for lots of properties of a desirable neural network architecture.
And I think people have been trying to change it, but it's proven remarkably resilient.
But I do think that there should be even better architectures potentially.
Currently, it definitely looks like the transformer is taking over AI, and you can feed basically arbitrary problems into it.
And it's a general, differentiable computer, and it's extremely powerful.
And this convergence in AI has been really interesting to watch for me personally.
Definitely the zeitgeist today is just pushing.
Basically, right now, the zeitgeist is do not touch the transformer.
Touch everything else.