Reiner Pope
๐ค SpeakerAppearances Over Time
Podcast Appearances
This becomes the dominant term.
The KB cache becomes the dominant term.
Yeah, you only need to store like one layer rather than two layers of KVs, right?
Yeah.
So it helps from that perspective.
Yeah.
What's competing with that, though, is that you need to be keeping all of the racks usefully busy at a time.
And so the number of sequences that are in flight simultaneously has gone up.
Yeah, yeah, yeah.
Makes sense, makes sense, makes sense.
So those exactly cancel, and you end up not getting a saving per GP.
Well, so first we did you can't amortize KV caches across batch size.
And now we're saying you also can't shard it across pipeline stages.
It sucks from both of those points of view.
Yeah, yeah, yeah.
Interesting.
Okay, so then what is that during inference?
So, I mean, the DeepSeq paper reports what they do, which is they just do a lot of expert parallelism.
You should...
In effect, you should increase your expert parallelism up to your scale-up domain size, and then do very little pipelining.