Reiner Pope
๐ค SpeakerAppearances Over Time
Podcast Appearances
And so all of those are competing for like modern racks are pushing all of those to very extreme physical limits.
Yeah.
Deploying in larger scale app domains is a huge unlock.
I mean, I've drawn here the sort of NVIDIA Blackwell deployment.
The Google deployment has actually had very large scale app domains.
Not having been there at the time, I'm not sure how much is coming from successfully deploying higher sparsity ratios, which could be.
It could also be... I mean, there's a whole bunch of actual modeling things of like...
Specifically, how do you do the mixture of experts?
We've seen the DeepSeq mixture of experts has said actually activate more experts but finer-grained experts was a big innovation.
I'm sure that there are many other innovations on the model architecture as well as on the training data.
It's kind of hard to disentangle all of them, but what shows up in terms of the limits of what you can do
The active parameters, as we saw, is limited by the compute cost.
And then the total parameters is limited by the scale-up size.
Yeah, really interesting.
Okay, so to answer that question, we're going to need to talk about the communication patterns.
So we've talked about the mixture of expert communication pattern.
That is this all-to-all.
This all-to-all.
All to all.
All to all very strongly favors full connectivity, which is what we've kind of just shown here, and favors being within one rack.