Reiner Pope
๐ค SpeakerAppearances Over Time
Podcast Appearances
So if we had fewer, we would have this idle time when we wrap around.
And so you can sort of just visually see that it is equal to the number of pipeline stages.
I mean, sort of proof by visual here, like it is four and it's four this way as well, but you can sort of look and see that it goes along here and then it wraps around a number of pipeline stages.
For sure, during massive-scale training, this is done.
It can be done for inference.
I'm actually going to make the case for why it is less attractive.
It is useful for weights, but not so useful for KVs.
The big challenge is, so let's fill this in.
The micro-batch size here ends up being equal to the number of pipeline stages.
When we go back and substitute this, all of that into here,
We get a number of pipeline stages times this little b showing up in here.
And then when we factor this out, I'm going to split this plus into two terms.
We get the full division by e times p over here.
We still have division by e times p over here, but the p's cancel, this p and this p.
They canceled.
And so what we find, if you increase the number of pipeline stages, the memory footprint for the number of weights keeps going down and down and down.
Of course.
But the memory footprint for the number of activations stays constant.
So it doesn't actually work.
Most of your memory ends up, once you do enough pipelining, and it's really not much, even two is often enough, this term becomes very small.