Reiner Pope
๐ค SpeakerAppearances Over Time
Podcast Appearances
Yeah, there's sort of two scenarios.
Why don't we pick a latency that is bigger than 15 milliseconds?
And if I think what that means, it means I actually have time to read the HBM like twice.
Yep.
By the way, most of HPM accesses is reads, not writes.
It's like, almost all reads, because the weight matrices are read-only, and then almost all of the KV cache accesses are reads.
So, let's say I run 30 milliseconds, I can read all of HPM twice.
But what's the point of that?
Like, I don't want to read the weight matrices twice.
I don't want to read the KVs twice.
I mean, sparsity shows up in model size, but beyond that, it only depends on sparsity, not on scale.
Yeah.
We can do a bit of analysis on this, which would be actually, it's like, you can think of it in terms of number of users, but maybe a more productive way to think of it is in terms of number of tokens per second.
So what does this batch size mean in terms of tokens per second of the system?
So...
tokens per second tokens per second is going to be equal to the batch size we run a batch many tokens and then we do that every um t so every time it tools which is let's say which is uh which is this thing is equal to the 15 milliseconds 20 milliseconds number so this ends up being batch size itself times
Uh, about 60, so, um, like 64 times B. Um, and so this ends up being around, uh, 2000 times 64.
So like 128, um, 128K, uh, token specific.
So this is sort of in more digestible units.
Like, uh, it's hard to reason about concurrent users, but what is the global traffic for, for a system?