Dwarkesh Patel
๐ค SpeakerAppearances Over Time
Podcast Appearances
Okay, so... The more sparsity you have, the less compute you need,
And it does seem that as batch sizes get bigger, compute ends up being the bottleneck, according to this analysis.
So then the question is, how far can you take sparsity?
That is to say, as the sparsity ratio increases, as you have fewer and fewer active parameters relative to total parameters, how much is performance of the model degrading?
And is it degrading faster than you're saving compute by increasing the sparsity factor?
Should we pull up the paper now?
10x as many active parameters.
Yeah, so while it is true, I guess, that you get this benefit of being able to economize on your compute time if you increase sparsity,
Naively, it would seem like, oh, that's a trade-off worth making.
But if you're decreasing this by 2x and then having this go up by 8x, every time you double...
So let me just make sure I understood.
You're saying we want bigger... We want... Does it mean less time computing?
Therefore, we do more sparsity.
To make that work, we need bigger batch sizes, which means we need more memory capacity.
Yeah, so... To have more sparsity.
So when you say any GPU in the pretense, the router is more than one GPU?
Yeah.
Before we... It may be worth you explaining...
What exactly a rack is, the differences in bandwidth between a rack and within a rack, and the all-to-all versus not-all-to-all nature of communication within versus outside.
Sorry, is that question you just asked, basically, why isn't it a bigger scale-up?