Demis Hassabis
π€ SpeakerAppearances Over Time
Podcast Appearances
It's not just about repeating the same recipe.
At each new scale, you have to adjust the recipe.
And that's a bit of an art form in a way.
And you have to sort of almost get new data points.
If you try and extend your predictions, extrapolate them, say several orders of magnitude out, sometimes they don't hold anymore, right?
Because new capabilities, they can be step functions in terms of new capabilities and some things hold and other things don't.
So often you do need those intermediate data points actually to correct some of your hyperparameter optimization and other things so that the scaling law continues to be true.
So there's sort of various practical limitations onto that.
So kind of one order of magnitude is about probably the maximum that you want to carry on, you want to sort of do between each era.
Yeah, the downstream capabilities sometimes don't follow from the, you can often predict the core metrics like training loss or something like that.
But then it doesn't actually translate into MMLU or math or some other actual capability that you care about.
they're not necessarily linear all the time.
So there's sort of nonlinear effects.
Um, well, I, I mean, I wouldn't say there was one big surprise, but it's, it was very interesting, you know, trying to train things at that, at that size and, and, and learning about, um, uh,
All sorts of things from organization or how to babysit such a system and to track it.
And I think things like getting a better understanding of the metrics you're optimizing versus the final capabilities that you want.
I would say that's still not a perfectly understood mapping, but it's an interesting one that we're getting better and better at.
I don't think that's the case.
I think that actually Gemini 1 used roughly the same amount of compute, maybe slightly more than what was rumored for GPT-4.
I don't know exactly what was used.