Menu
Sign In Search Podcasts Charts People & Topics Add Podcast API Blog Pricing

Reiner Pope

๐Ÿ‘ค Speaker
1157 total appearances

Appearances Over Time

Podcast Appearances

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

Like we have to imagine dividing each of these three curves by B, so multiplying by this reciprocal.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

And so what we end up with there is the compute curve is gonna, it was linear, we divide by B, that makes it a constant here.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

This is T compute.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

The KV fetch was linear, now it becomes a constant as well.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

And then the weight fetch was constant, and now we're divided by b, and so it becomes this hyperbola.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

And so again, we're going to compute the max of the sum.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

So the sum of these two terms shifts the parabola up.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

Sum of the kbfetch and the weightfetch gives us a higher parabola that's like this.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

And then we're going to take the max with the compute here.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

So we end up with this being the overall shape that we care about.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

So again, we see some limiting behavior.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

The cost initially starts very high at batch size of one.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

Actually, it almost goes to infinity.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

It's because we've got so many weight fetches which are not amortized over a large batch size.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

But then as we increase the batch size, the weight fetches become amortized over so many different batch elements that their cost grows very small.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

And eventually, the compute time ends up driving the cost.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

So there is a limiting like lower bound on cost

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

which is this one here.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

Yeah, they're unique per batch.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

The compute is also unique per batch.