Gwern Branwen
👤 PersonAppearances Over Time
Podcast Appearances
And that made me really angry.
I had to write all this stuff up.
Someone was wrong on the Internet.
I think for the most part, they were suffering from two issues.
First, I think they hadn't really been paying attention to all of the scaling results before which were relevant.
They hadn't really appreciated the fact that, for example, AlphaZero was discovered in part by DeepMind doing Bayesian optimization on hyperparameters and noticing that you could just get rid of more and more of the tree search and get better models.
That was a critical insight, I think, which could only have been gained by having so much compute power that you could afford to train many, many versions and see the difference that that made.
Similarly, I think those people kind of simply just didn't know about the Baidu paper on scaling laws from 2017, which showed that the scaling laws just keep going, going forever, practically.
It should have been the most important paper of the year, but I think that a lot of people just didn't prioritize it.
It didn't have any immediate implication, and so it sort of just got forgotten.
People were too busy discussing Transformers or AlphaZero or something at the time to really notice it.
So that was one issue.
And I think another issue is that they shared the basic error that I was making about algorithms being more important than compute.
This was in part, I think, due to a systematic falsification of the actual origins of ideas in the research literature.
Papers don't tell you where the ideas come from in a truthful manner, right?
They just tell you a nice sounding story about how it was discovered.
They don't tell you how it's actually discovered.
And so even if you appreciate the role of trial and error and compute power in your own experiment as a researcher, you probably just think, oh, I got lucky that way.
My experience is unrepresentative.
Over in the next lab, there they do things by the power of thought and deep insight.