Reiner Pope
๐ค SpeakerAppearances Over Time
Podcast Appearances
1.3b dense yeah across you end up seeing that for example in this case the 64 expert 370 million activated parameters model is as good as a dense 1.3 billion model so in some sense it's actually not amazing returns where you need to increase total parameters a hundredfold to get the equivalent of
Yeah, I mean, actually, even more so, yeah.
It's a huge increase in parameter count for a modest increase in... Yeah, so in this case, actually, what is it, 4x?
64x for 4x.
So is that good or bad, actually?
Even from a memory point of view, keep in mind you are doubling this portion of the memory fetches, which is amortized by batch.
And so just keep running out larger batch size.
From the point of view of the analysis we've done here, this is pure win.
Keep doing it.
Keep doing it until you run out of available users, basically.
So there's actually this equivalence between...
If I want to go sparse, or if I have a lot of users, I can go to a much sparser model.
So from that point of view, it's a reasonable trade-off.
The other trade-off that shows up here is that it also consumes memory capacity, which we've only reasoned about memory-bound with it, but it also consumes memory capacity.
Yeah, so, I mean, maybe this would be a good point to actually talk about how a mixture of experts layer is typically layered out on a rack of GPUs or something like that.
Yeah, yeah, makes sense.
Yeah, where were we?
Sparse mixture of experts.
Yes.
Maybe how we lay that out on a GPU.