Menu
Sign In Search Podcasts Charts People & Topics Add Podcast API Blog Pricing

Reiner Pope

๐Ÿ‘ค Speaker
1157 total appearances

Appearances Over Time

Podcast Appearances

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

We have a divisibility problem.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

This is not a power of two.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

So we'll just simplify and say we're only going to use 64 of them.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

Just ignore the other eight.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

It's not a big deal.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

And so we have four experts per GPU.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

Very simple.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

For the sake of the diagram, I'll actually just say, let's say we have two experts per GPU.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

So we end up just putting, these are the GPU boundaries.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

Every pair of experts is on its own GPU.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

And then we can look at the communication cost.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

We had some tokens stored centrally here.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

They get routed to all of these experts.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

And so there is some communication cost paid here.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

There's the same communication cost paid on the output.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

And then the hope is that this does not become communication limited.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

Now, what is the traffic pattern here?

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

The traffic pattern here is that any GPU, in fact, will be talking to any other GPU, depending on the decisions made by the model.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

So this is an all-to-all traffic pattern.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

Yeah.