Reiner Pope
๐ค SpeakerAppearances Over Time
Podcast Appearances
Okay, so this is like, why is this correct divided this way?
Well, we're saying, we knew that the parameters were perfectly divided amongst all the GPUs in a rack.
The layers are perfectly divided amongst the different racks.
So that works here.
And somehow we're going to arrange, I'll hand wave exactly how, somehow we can arrange the same perfect sharding of the contexts across GPUs in a rack and based on layer across racks.
And sorry, four is the number of racks.
Yeah, for example.
So...
This is the place where we actually need to go back and analyze this batch size B. And you were making this comment that there's micro-batching versus global batching.
So let's come back to this pipelining diagram here.
We've got one batch going forward here.
And then as I drew it, it kind of just like disappeared.
That's not really correct.
If you think about how decode is working, I have a bunch of tokens that I have generated already.
I do one forwards pass where I generate a new token.
And then, and then I push, like, then I write that to my KB cache and then I do another forwards pass that generates the next token.
So I'm actually going to be running this batch zero in a loop.
So in fact, I go forwards.
Once I finish, I can start the next iteration of the loop up here.
Yeah.