Reiner Pope
๐ค SpeakerAppearances Over Time
Podcast Appearances
is equal to scale up size,
memory bandwidth per gpu yeah yeah times gpu bandwidth um uh and so this term doesn't increase a lot it maybe increases 1.5 or 2x per generation but this one increased by like a factor of eight um from these problems so the reason the bigger scale of matter is not the memory capacity of the whole scale scale up but really the memory bandwidth yeah yeah pipelining totally solves the capacity problem but um but uh uh scale up size helps solve the bandwidth problem
Yeah.
It lets you just run the model at lower latency as a first thing.
If I just do a very sparse model and it's on a little H100 box, the latency will be really high.
Yeah.
This is a place where we have to do a bit of guesswork because like the updated scaling laws and the model traffics are not reported.
And so we have to guess there.
But one way to look at it, let me first just make a sort of a general heuristic claim.
If I have some like cost and I've got a total cost, which is a sum of like cost A and cost B,
Like maybe this is the training cost and this is the inference cost.
And so I want to minimize this sum.
For many curves that tend up being the case, the minimum tends to be where the costs are equalized.
That's something of a heuristic claim, but there are many examples where it's true, like where one is 1 over x and the other one is x, for example.
They tend to be minimized at the point where they equal each other.
It's also true for e to the x and e to the minus x and all kinds of other things.
So basically, I've got some curve that's going down, some other curve that's going up, and they tend to be minimized at this equal point.
Heuristically, I will conjecture that that is true for the setup you described as well.
Actually showing that that would be true would require looking at the scaling laws and fitting these weird exponents.
But things that do follow power laws tend to have this property.