Reiner Pope
๐ค SpeakerAppearances Over Time
Podcast Appearances
Maybe none at all, maybe two, just enough to make the weight storage not too big of an issue.
Those are the only two parallelisms that really make sense.
In the past, there was tensor parallelism, which was cutting up within an expert, but the experts are so small now that that is not a profitable optimization.
Yes.
Yeah, I mean, you can look at how it depends on model size.
Like, you could have a very large model, like one that exceeds the memory of a rack, and there you should be doing a bit of pipelining.
Maybe it's extremely sparse, for example, and that would be a reason to do it.
Actually, so pipelining doesn't help with context length.
It totally helps with model size.
And so because of the ability to do pipelining, at least a rack should not be a constraint on your ability to fit the model parameters.
I guess the other consideration you're asking like, why hasn't it scaled up more and why did bigger scale-up domains help?
So we talked through one aspect of that, which is, we kind of said it's not because of memory capacity.
We have a solution to the memory capacity, at least with respect to model size, not with respect to KV cache size, but at least with respect to model size, we have a solution to memory capacity.
The other issue that shows up is latency.
This is very much dependent on the hardware.
It's
I can't say with a lot of authority.
I think it's probably on the order of a few milliseconds, but it could be off by an order of a few seconds.
Yeah.
Okay, so that's not that much.