Menu
Sign In Search Podcasts Libraries Charts People & Topics Add Podcast API Blog Pricing

Dwarkesh Patel

๐Ÿ‘ค Speaker
15267 total appearances

Appearances Over Time

Podcast Appearances

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

The compute time?

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

But suppose it's like...

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

This is a very simple algebra problem, but suppose the optimal is 100k context length.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

And you go to 200k context length.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

Does your MFU go down to like 50%?

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

Does it have a humongous impact on MFU?

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

Yeah, it does.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

To be like slightly outside of context length, optimal range, Goldilocks zone.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

Got it.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

And is sparse attention what everybody uses in practice?

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

So Claude code slow or codex slow or whatever would just live on this line and it wouldn't help much because you're not able to amortize the KV values over a much bigger batch.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

So this point where you are no longer memory bandwidth bound,

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

How big a batch do you need?

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

How big are the batches practically for Frontier models?

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

Sorry, has that ratio changed over time as we've gone from model generation to model generation where the flops keeps increasing?

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

Okay, so basically it's like 2,000 to 3,000 tokens per batch.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

But then if you included the KB cache, the implication would be that the optimal batch size should grow larger.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

This seems incredibly small.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

Like a batch, this would be like less than one sequence, right?

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

the a single forward pass yeah on these sequences this is like do you think of like the bash is the number of sequences rather than like that's right okay cool yeah