Reiner Pope
๐ค SpeakerAppearances Over Time
Podcast Appearances
More generally, it's called the scale-up network.
This is the scale-up network.
You will typically also have a scale-out network, which allows you to connect to some data center switch.
So data center switch.
then all of the gpus will have some connectivity up to some data center switch somewhere um but this is this is about times uh like this is the scale out um and it tends to be about about eight times slower uh in bad words so the the challenge if you want to for example lay out a mixture of expert layer across two racks is that
half of the GPUs here are going to be wanting to talk to the GPUs here.
And so just on average, when I look at where the tokens on these GPUs want to go, half of the tokens want to go inside the rack.
That's great.
They can use the fast scale up network.
But half the tokens are going to want to leave the rack and go to the other rack.
And that's not as good.
They're going to need to use a much slower network.
And so that becomes the bottleneck on the all-to-all pattern.
A different choice would be, well, why don't I have a big switch here and connect everything to some big switching, like a much bigger switch that actually combines the two racks together.
There are many ideas in this direction, but in general, the reason you have this hierarchy of switches rather than one big switch is to manage the cabling congestion.
You just need to run a large number of cables.
Yeah, exactly.
Why not just have a million chips in scale-up?
ruben will be i don't know is it 500 something 500 something yeah um what has allowed that to happen uh from hopper to blackwell is is mostly just uh uh the decision to switch from uh uh trays as the form factor one of these is a tray just to switching to racks as the form factor that's a product decision it's um there wasn't a substantial technical barrier there um
uh switching from uh from the like uh 64 to 500 or so um there's a bit of chintz and math there but uh uh there is at least a genuine forex increase um which is um coming from a much more complicated and difficult rack design and so that is actually like new new physical design to run more cables and the cable complication is just the