Reiner Pope

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

This is bytes per second.

1096.465 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

So that's not quite dimensionless.

1097.466 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

But what do you do is you say like multiplies per second times, let's say I'm doing FP4.

1099.368 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

So I do like how many FP4 multiplies per second

1105.614 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

times the fact that each one, each FB4 is half a byte.

1109.36 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

And so I can actually make this ending up being dimensionless.

1113.729 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

And this ends up being on most GPUs around 300.

1118.239 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

Somewhere around 300.

1125.104 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

So this is a hardware parameter.

1131.332 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

To what extent has the hardware changed?

1132.875 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

So from like A100 to H100 to B100, the flops has increased substantially.

1134.357 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

The memory value has also increased substantially and it has remained reasonably stable.

1141.046 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

And we can express this one as well.

1145.692 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

This is a sparsity parameter.

1147.336 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

And I might even phrase it slightly different.

1149.32 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

Let's solve for batch size in total.

1151.425 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

We end up with, so we're just moving this back over to the other side.

1153.79 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

We end up with batch size needs to be bigger than approximately 300 times sparsity.

1157.158 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

So for example, if I have a hundred, like I activate in DeepSeq, I activate 32 out of 256 experts.

1165.435 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

So this would be like eight for DeepSeq.

1173.142 View full episode →

Appearances Over Time

Podcast Appearances

Sign in to Audioscrape

Share this moment