Menu
Sign In Search Podcasts Charts People & Topics Add Podcast API Blog Pricing

Reiner Pope

๐Ÿ‘ค Speaker
1157 total appearances

Appearances Over Time

Podcast Appearances

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

On a small number of pipelining stages, this is not a huge latency impact.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

Yeah, if it goes from 20 to 30, right?

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

Or something like that, yeah.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

So just to chart the path that it goes through, here you're going from your GPU or TPU or whatever to a network card, which then goes to a top of rack switch.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

and then hops over to the other rack and does the same thing in reverse.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

So you sort of have to sum up the latencies of these different things.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

So this is the same thing as the DC switch?

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

It may, in fact, go up to a DC switch and back.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

Depends on deployment configuration.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

Yeah, so, I mean, we talked about latency of the hop, of this hop.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

There is also just the same T-mem latency, the memory time latency, is actually substantially, like, massively improved by larger scale-up domains.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

So...

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

I'll recall tmem down here.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

tmem for the weights, tmem of weights.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

This was equal to the number of total parameters divided by the memory bandwidth.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

which memory bandwidth are we talking about here?

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

Is it just one GPU or it's in fact, it is the number of GPUs that I can use in parallel to load these weights.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

So I can't use different pipeline stages in parallel because they're not running at the same time, but I can use all the GPUs in my scale up domain in parallel to load the weights.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

And so this is actually extremely effective.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

So basically I end up with a term here, this memory bandwidth term itself,