Menu
Sign In Search Podcasts Charts People & Topics Add Podcast API Blog Pricing

Reiner Pope

๐Ÿ‘ค Speaker
1157 total appearances

Appearances Over Time

Podcast Appearances

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

This is bytes per second.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

So that's not quite dimensionless.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

But what do you do is you say like multiplies per second times, let's say I'm doing FP4.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

So I do like how many FP4 multiplies per second

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

times the fact that each one, each FB4 is half a byte.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

And so I can actually make this ending up being dimensionless.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

And this ends up being on most GPUs around 300.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

Somewhere around 300.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

So this is a hardware parameter.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

To what extent has the hardware changed?

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

So from like A100 to H100 to B100, the flops has increased substantially.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

The memory value has also increased substantially and it has remained reasonably stable.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

And we can express this one as well.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

This is a sparsity parameter.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

And I might even phrase it slightly different.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

Let's solve for batch size in total.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

We end up with, so we're just moving this back over to the other side.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

We end up with batch size needs to be bigger than approximately 300 times sparsity.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

So for example, if I have a hundred, like I activate in DeepSeq, I activate 32 out of 256 experts.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

So this would be like eight for DeepSeq.