Gwern Branwen
👤 PersonAppearances Over Time
Podcast Appearances
Then it turns out that everywhere you go, compute and data and trial and error and serendipity just play enormous roles in how things actually happened.
Once you understand that, then you understand why compute comes first.
You can't do trial and error and serendipity without it.
You can write down all these beautiful ideas, but you just can't test them out.
So even a small difference in hyperparameters or a small choice of architecture can make a huge difference to the results.
But when you can only do a few instances, you would typically end up finding that it just doesn't work, or maybe you would give up and you would go away and do something else.
Whereas if you had more compute power, you can just keep trying.
Eventually you hit something that works great.
And once you have a working solution, you can kind of simplify it and improve it and figure out why it worked and get a nice robust solution that would work no matter what you did to it.
But until then you're stuck and you're just kind of like flailing around in this regime where nothing works.
You know, you can have this horrible experience now where you go back through the old deep learning literature and see all these sorts of contemporary ideas that people had back then, which were completely correct, but they didn't have the compute to train what you know would have worked.
You know, and it's tremendously tragic, right?
You go back and you can look at things like ResNet's being published back in 1988 instead of 2015.
And it would have worked.
It did work, but it's such a small scale that it was irrelevant.
You couldn't use it for anything real, and it just got forgotten.
So you have to wait until 2015 for ResNets to actually come along and be a revolution in deep learning.
So that's kind of the double bias of why you would believe that scaling was not going to work.
Because you didn't notice the results that were key in retrospect, like the big GAN scaling to 300 million images.
I think there's still people today who would tell you with a straight face that GANs can't scale past millions of images.