Reiner Pope
๐ค SpeakerAppearances Over Time
Podcast Appearances
Okay.
So why do we do... What is this micro-batching that shows up in pipeline parallelism?
So...
I'll focus on inference first.
It's a slightly simpler problem.
And I'm going to draw, so this is time, and then this is which rack we're on.
And so the idea is that maybe I'll have four racks.
So I've got an inference that is going to step through these four racks in some time like this.
So great, this is inference number zero.
It runs at a certain batch size, and it steps through all the pipeline stages like this.
Now, if we were to say, well, we're going to run inference number one here, this is clearly a massive waste, right?
Like three quarters of the time, each of the racks is doing nothing.
So we don't actually run inference one here.
We run it as soon as we can, which is immediately after inference zero finishes like this.
And then we keep going.
So if we hadn't filled this in, we would call this the pipeline bubble.
When I've drawn it in this inference context where we're only going in a forwards pass, it's like obvious, like why would you do this stupid thing?
But in a training context, it's maybe less obvious.
But in the inference context, it's sort of really natural to make this change.
Yeah, let's do that.