Reiner Pope
๐ค SpeakerAppearances Over Time
Podcast Appearances
Yeah.
So we'll start off with just memory capacity without even thinking about parallelism scheme.
And so the capacity of memory or the demand on memory is the number of total parameters.
So this is what we need to fit the weights in some system that we are using.
And then we need to fit the KVs as well.
So KVs go as batch size times the length of the context times the bytes per check.
Okay, so...
What I was arguing about in this context and the case I was making for pipelining is that we will actually, there are some techniques that allow us to solve this, other techniques that allow us to solve this.
So let's consider, so we're going to run this on some number of GPUs and we're going to say, we're going to have one extended, which is E is going to be the expert parallelism.
So how many, when we had this charting of expert layer across many GPUs, how much of that, to what extent do we do that?
How many GPUs?
So we're going to say that this is, for example, 64.
And then P is going to be the extent of pipeline, pipelining.
And so this is the number of racks, which who knows, maybe we'll pick four or something like that.
What we want to calculate, so this is like the total memory requirement across the system.
But now I'm going to calculate a memory requirement per GPU.
So per GPU memory requirement, we're going to have, I guess I'll use a lowercase c, mem.
And well, obviously we just take all of these numbers and divide it by E and P, really easy.
So it's this end total plus the batch times length of context times bytes of toke.
All of this is divided by E and P.