Menu
Sign In Search Podcasts Charts People & Topics Add Podcast API Blog Pricing

Reiner Pope

๐Ÿ‘ค Speaker
1157 total appearances

Appearances Over Time

Podcast Appearances

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

Yeah.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

So we'll start off with just memory capacity without even thinking about parallelism scheme.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

And so the capacity of memory or the demand on memory is the number of total parameters.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

So this is what we need to fit the weights in some system that we are using.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

And then we need to fit the KVs as well.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

So KVs go as batch size times the length of the context times the bytes per check.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

Okay, so...

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

What I was arguing about in this context and the case I was making for pipelining is that we will actually, there are some techniques that allow us to solve this, other techniques that allow us to solve this.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

So let's consider, so we're going to run this on some number of GPUs and we're going to say, we're going to have one extended, which is E is going to be the expert parallelism.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

So how many, when we had this charting of expert layer across many GPUs, how much of that, to what extent do we do that?

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

How many GPUs?

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

So we're going to say that this is, for example, 64.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

And then P is going to be the extent of pipeline, pipelining.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

And so this is the number of racks, which who knows, maybe we'll pick four or something like that.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

What we want to calculate, so this is like the total memory requirement across the system.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

But now I'm going to calculate a memory requirement per GPU.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

So per GPU memory requirement, we're going to have, I guess I'll use a lowercase c, mem.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

And well, obviously we just take all of these numbers and divide it by E and P, really easy.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

So it's this end total plus the batch times length of context times bytes of toke.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

All of this is divided by E and P.