Menu
Sign In Search Podcasts Charts People & Topics Add Podcast API Blog Pricing

Reiner Pope

๐Ÿ‘ค Speaker
1157 total appearances

Appearances Over Time

Podcast Appearances

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

And then we're also maybe doing it many times.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

So that's going to be what makes the difference.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

Yeah, so number of activated GPUs, right?

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

So, like, I don't send to this GPU at all, right?

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

So there's an explosion from one to, like, three times larger here in this diagram.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

Yeah.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

The key thing is that I didn't even need to send to this GPU at all, and so that's a big saving.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

I see, yeah.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

Okay, so we're gonna talk through sort of how much more, what is the slowdown of, to what extent is scale up a bottleneck over scale out?

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

So we will directly jump to the ratio of the time spent on scale up, time on scale up, over the time spent on scale out.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

So this is the quantity we're talking about.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

And the first consideration is that the scale up is eight times faster than scale out generally.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

And so at a baseline, if the bandwidths were the same, we would have this one over eight, which is coming from bandwidths.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

But then we have some amount of expansion in how much data we're sending.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

So if one token comes in here, then this one token gets routed to, in the DeepSeq case, it'll get routed to maybe 32 experts or 16 experts, gets routed to some number of experts.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

So this is the number of activated experts, number of activated experts.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

And then it also, the same thing applies on multiple different layers.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

So maybe I'm going to run two layers.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

So there's also multiple times number of layers per stage.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

Yes, yes.