Reiner Pope
๐ค SpeakerAppearances Over Time
Podcast Appearances
Thanks.
You can think of this as a schedule for the train.
A new train departs every 20 milliseconds.
Any passengers who are ready board the train.
If the train is full, then they wait to the next train.
If the train is not full, the train's going to go anyway.
And so in terms of what that means for queuing latency, it means that the worst case is that a request arrives just after the train departed.
It has to wait for the next train, so that's up to 20 milliseconds, and then it has to wait for that train
to complete.
And so the worst case latency is 40 milliseconds.
So how is it 20 milliseconds derived?
I mean, rule of thumb, but where it comes from is not fully explained yet, but so far we've focused on memory bandwidth and compute time.
When we look at memory, the other consideration is that we want to use all of the memory capacity we have.
And so generally we're going to use all of that memory capacity to store the weights or the kbs.
And so we just want to read, like in the time of doing a forward pass, maybe we want to read all of the memory capacity into the chip.
And so that is capacity divided by bandwidth.
That tends to be 20 milliseconds on many different generations of HPM.
Yeah.
So for example, I mean, on I think the Rubin generation, it is something like 288 gigabytes divided by 20 terabytes per second.
And this looks like it comes out to about 15 milliseconds.