Reiner Pope
๐ค SpeakerAppearances Over Time
Podcast Appearances
Great.
I mean, a little bit to jump to the conclusion, the big effect is batch size, but what we're going to do now is quantify exactly what that looks like and what its implications are on latency and cost.
There's going to be another effect, which is, you can call it speculative decoding or multi-token prediction.
We can maybe come back to that later, but I think the first thing that we'll talk through is batch size.
So what I'd like to introduce is sort of the two principles of analysis.
Firstly, we're going to look at a roofline analysis of how I run a transformer model on a cluster of chips.
We'll take a sort of, let's say, a Blackwell NVL72 cluster, so a rack of 72 GPUs.
And so the roofline analysis means we look at memory bandwidth and compute performance.
And then the other side of that is that we're going to look at just two simple factors of the model, which are the time to operate on the weights and then the time to operate on the context, the KB cache.
So let's jump in.
What we're going to try and do is we're going to try and estimate the time that it takes to run an inference of a certain shape.
Now,
We're not perfect here.
We can't exactly predict the time.
And so instead, we're going to approximate.
And so we're going to say that the time must be greater than or equal to a certain quantity.
And so we're going to consider two different aspects.
We're going to look at the time it takes to do the memory fetches and then the time it takes to do the compute.
And it'll turn out that this actually gives us a very strong predictive power, even with a simple one.
So one by one, what is the time that it takes to do the compute?