Andrej Karpathy
๐ค SpeakerAppearances Over Time
Podcast Appearances
Residual networks just came out.
So remarkably similar, I guess, but quite a bit different still.
I mean, Transformer was not around.
You know, all these sort of like more modern tweaks on the Transformer were not around.
So maybe some of the things that we can bet on, I think, in 10 years by translational sort of equivariance is we're still training giant neural networks with forward, backward, pass, and update through gradient descent.
But maybe it looks a little bit different.
And it's just everything is much bigger.
Actually, recently, I also went back all the way to 1989, which was kind of a fun exercise for me a few years ago, because I was reproducing Jan LeCun's 1989 convolutional network, which was the first neural network I'm aware of trained via gradient descent, like modern neural network trained gradient descent on digit recognition.
And I was just interested in, okay, how can I modernize this?
How much of this is algorithms?
How much of this is data?
How much of this progress is compute and systems?
And I was able to very quickly like half the learning rate, just knowing by time travel by 33 years.
So if I time travel by algorithms to 33 years, I could adjust what Yann LeCun did in 1989, and I could basically half the learning, half the error.
But to get further gains, I had to add a lot more data.
I had to 10x the training set.
And then I had to actually add more computational optimizations.
I had to basically train for much longer with dropout and other regularization techniques.
And so it's almost like all these things have to improve simultaneously.
So we're probably going to have a lot more data.