Reiner Pope
๐ค SpeakerAppearances Over Time
Podcast Appearances
This is bytes per second.
So that's not quite dimensionless.
But what do you do is you say like multiplies per second times, let's say I'm doing FP4.
So I do like how many FP4 multiplies per second
times the fact that each one, each FB4 is half a byte.
And so I can actually make this ending up being dimensionless.
And this ends up being on most GPUs around 300.
Somewhere around 300.
So this is a hardware parameter.
To what extent has the hardware changed?
So from like A100 to H100 to B100, the flops has increased substantially.
The memory value has also increased substantially and it has remained reasonably stable.
And we can express this one as well.
This is a sparsity parameter.
And I might even phrase it slightly different.
Let's solve for batch size in total.
We end up with, so we're just moving this back over to the other side.
We end up with batch size needs to be bigger than approximately 300 times sparsity.
So for example, if I have a hundred, like I activate in DeepSeq, I activate 32 out of 256 experts.
So this would be like eight for DeepSeq.