John Schulman
๐ค SpeakerAppearances Over Time
Podcast Appearances
I don't think anyone has a good answer for a good explanation of the scaling law with parameter count.
I mean, there are some
uh i don't even know what the uh what the best um sort of mental model is uh for this like clearly you have more capacity if you have a bigger model but uh so like you should be able to eventually get uh lower loss but i guess why are bigger models more sample efficient um i guess you could um i can give you some like very sketchy uh explanations like uh like they have um
You could say that the model is sort of an ensemble of a bunch of different circuits that do the computation.
You could imagine that it has a bunch of computations that it's doing in parallel, and the output is a weighted combination of them.
if you have more um just width of the or if you just have i mean actually width is somewhat similar to depth because uh like with residual networks uh you end up like the depth can do something similar to width in terms of like updating what's in the residual stream but uh
If you, yeah, you could argue that you're learning all these things in parallel, you're learning all these different computations in parallel and you just have more of them with the bigger model.
So you have more chance that one of them is lucky and ends up like having high, like winning, guessing correctly a lot and getting up-weighted.
So that's kind of like a...
uh what would be the yeah there's some algorithms uh that work this way like that um
like mixture, what is it?
Mixture, some kind of mixture model or multiplicative weight update algorithm.
Yeah, there's some algorithms that kind of work like this.
So where you have like some kind of mixture of, I don't wanna say mixture of experts, cause it means something different, but like basically a weighted combination of experts with some learned gating and,
I actually, anyway, I said something slightly wrong, but anyway, yeah, you could imagine something like that and just having a bigger model gives you more chances to get the right function.
So that would be, and then of course, it's not just like you have a bunch of like totally disjoint, like functions that have,
you're taking a linear combination of.
It's more like a library where you might chain the functions together in some way.
So there's some composability.
So I would just say the bigger model has a bigger library of different computations, including lots of stuff that's kind of dormant and only being used some of the time.