Reiner Pope
๐ค SpeakerAppearances Over Time
Podcast Appearances
Yeah, right, right.
More usefully, you can do the weight gradient step, but you can also mine Bitcoin.
In inference, actually, the effect of pipelining on anything you care about, like batch size or latency, actually is neutral.
It doesn't improve it.
It doesn't make it worse.
So if you look at the latency of this inference running it if it were pipelined versus if it were all on one rack, if it were all on one rack, we would just slide all of the boxes down and still put them in a row, and the latency would be the same.
so um pipelining is neither better nor worse for latency um but it it does mean that you just use less memory per per rack like memory capacity because now instead of needing the whole model you only need a quarter of the model so basically no brainer to use pipelining during inference
So even in inference, in fact, it is not used a ton.
It reduces your memory capacity requirements.
There's actually a huge surplus.
I think you're saying that a rack of Blackwell has many, many terabytes, maybe tens of terabytes.
That's much bigger than a trillion parameter model.
A trillion parameter model only needs one terabyte.
And so it already fits, in fact.
And so there's not a huge benefit from...
from pipelining because you're reducing a number that's already pretty small.
But it does say that theoretically maybe you had too much memory and maybe you could have built a different hardware that has less memory, in fact.
If you were designing your hardware and you said, I actually didn't need that much memory because I don't need the weights to fit in one rack, I can fit the weights in eight racks, then I could have maybe built a hardware that didn't have so much HPM per GPU.
Yeah, so in the equations we had here before we raised them, we were doing memory time, so memory bandwidth and compute bandwidth.
Let's now start looking at memory capacity.