Menu
Sign In Search Podcasts Charts People & Topics Add Podcast API Blog Pricing

Reiner Pope

๐Ÿ‘ค Speaker
1157 total appearances

Appearances Over Time

Podcast Appearances

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

More generally, it's called the scale-up network.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

This is the scale-up network.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

You will typically also have a scale-out network, which allows you to connect to some data center switch.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

So data center switch.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

then all of the gpus will have some connectivity up to some data center switch somewhere um but this is this is about times uh like this is the scale out um and it tends to be about about eight times slower uh in bad words so the the challenge if you want to for example lay out a mixture of expert layer across two racks is that

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

half of the GPUs here are going to be wanting to talk to the GPUs here.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

And so just on average, when I look at where the tokens on these GPUs want to go, half of the tokens want to go inside the rack.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

That's great.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

They can use the fast scale up network.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

But half the tokens are going to want to leave the rack and go to the other rack.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

And that's not as good.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

They're going to need to use a much slower network.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

And so that becomes the bottleneck on the all-to-all pattern.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

A different choice would be, well, why don't I have a big switch here and connect everything to some big switching, like a much bigger switch that actually combines the two racks together.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

There are many ideas in this direction, but in general, the reason you have this hierarchy of switches rather than one big switch is to manage the cabling congestion.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

You just need to run a large number of cables.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

Yeah, exactly.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

Why not just have a million chips in scale-up?

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

ruben will be i don't know is it 500 something 500 something yeah um what has allowed that to happen uh from hopper to blackwell is is mostly just uh uh the decision to switch from uh uh trays as the form factor one of these is a tray just to switching to racks as the form factor that's a product decision it's um there wasn't a substantial technical barrier there um

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

uh switching from uh from the like uh 64 to 500 or so um there's a bit of chintz and math there but uh uh there is at least a genuine forex increase um which is um coming from a much more complicated and difficult rack design and so that is actually like new new physical design to run more cables and the cable complication is just the