Menu
Sign In Search Podcasts Charts People & Topics Add Podcast API Blog Pricing

Reiner Pope

๐Ÿ‘ค Speaker
1157 total appearances

Appearances Over Time

Podcast Appearances

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

And so what is the minimum work you can do per batch after amortizing everything else away?

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

You can just solve for that, actually.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

And it's not even particularly sensitive to model architecture.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

So let's go ahead and do that.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

So what we're talking about is we're going to say when the memory time is equal to the compute time.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

That's what that question is.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

For now, I'm going to discard the... Because we're focused on what the batch size is, and really there's a question of when the weights are amortized over the multiplies, I'm going to focus on comparing the weight fetch time to the weight multiply time.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

I'm going to disregard the kbfetch term just to simplify the analysis so we can get kind of a clean answer out.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

So we're going to equate this portion

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

with this, with these two times.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

So writing that out, we get n number of total parameters over memory bandwidth is equal to batch size times number of active parameters divided by the compute performance.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

So looking over here, everything on the top, these are model parameters.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

Everything on the bottom, these are hardware parameters.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

It turns out to be nice to rearrange them such that we have the hardware parameters on one side.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

So this is equivalent to... So the memory bandwidth being equal to batch size times number of active parameters

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

divided by the number of total parameters.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

So this is a hardware parameter.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

Actually, this actually ends up being a dimensionless constant.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

If you look in terms of flops, what are the dimensions of this?

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

This is multiplies per second.