Reiner Pope
๐ค SpeakerAppearances Over Time
Podcast Appearances
Let's do that.
Okay.
So this is the inference diagram, and I'll call this forward, so we don't have the wrong thing showing up there.
So let's do the same thing for training now.
We've got a forwards pass, but at some stage, we're going to have to transition to a backwards pass.
So we'll do some number of batches in the forwards pass.
And then we're going to transition to the backwards pass for everyone all in one go.
So the inference part is the same here, but then we do a hard stop at this point and then transition everyone to backwards pass.
Similar numbering like this.
Yeah, I mean, smaller is always better, actually, is a way to put it.
But from a ML convergence rate perspective, smaller is always better because basically you're getting the freshest information from the gradient descent.
But total trading time perspective.
Total trading time perspective, smaller is worse from a systems perspective, and so the optimum is the trade-off between those two.
So you pick a batch size, and then for that batch size, you do some amount forwards and then some amount backwards.
Yep.
You asked why is there even a hard stop.
Pipeline parallelism, because of the fact that you've got this idle time here, which is the bubble,
There are so many techniques in the literature for how to lay this out differently and avoid that.
There are more complicated schemes called like zero bubble or one forward, one backward, which sort of interleave the forwards and the backwards in complicated ways.
You can mine Bitcoin in that.