Menu
Sign In Search Podcasts Charts People & Topics Add Podcast API Blog Pricing

Reiner Pope

๐Ÿ‘ค Speaker
1157 total appearances

Appearances Over Time

Podcast Appearances

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

which is linear in batch size.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

So it looks like that.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

So the sum of this plus this maxed with this.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

So let's at least first draw the sum.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

So the two memory times in conjunction end up looking on this curved slope like this.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

And then we get the overall maximum is, I'll draw a little thicker here, is the maximum of these two curves.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

Make sense?

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

Okay, so what does this mean actually?

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

So this is a latency plot.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

So if I grow my batch size, I get initially some not very strong dependence on batch size.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

And so there's some lower bound on latency here, latency lower bound.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

Lower bound.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

So this already partially answers the question.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

For a given hardware configuration, and we can talk about varying hardware configuration, but for a given hardware configuration, there is a lower bound on latency, which is simply the, I need to read all of my total parameters from memory into the chips.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

And that takes a certain amount of time.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

If I use all of my memory bandwidth, I can't do any better than that.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

Yeah, this is really sensitive to the context length.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

So I think we should come back and explore this.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

There will be, as you vary the context length, the KB fetch time will go up and up.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

And so that'll cause a transition from compute limited to memory limited.