Menu
Sign In Search Podcasts Charts People & Topics Add Podcast API Blog Pricing

Reiner Pope

๐Ÿ‘ค Speaker
1157 total appearances

Appearances Over Time

Podcast Appearances

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

So there are really two things I need to do in the compute.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

I need to multiply by all of the active parameters, and then I need to do some work on the attention.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

So multiplying by all the active parameters, I have a certain batch size that I'm running, and then I've got a number of active parameters in my model.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

And then I'm just going to divide this by the compute throughput, which is the flops of the chip.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

So this is a hardware constant.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

So this actually accounts for all of the compute time for all of the weight matrix multiplies.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

There's a little caveat here.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

We've sort of ignored the time to do any of the attention computation, but that in general will be quite small in comparison to this.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

So we'll ignore this.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

Yeah, so I can motivate the batch at least a little bit.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

So, I mean, we will see exactly why batch is such a favorable optimization.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

But what will turn out to be the case is that if you do not batch together many users, the cost and the economics you get can be like a thousand times worse than if you do batch many users together.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

And we'll be able to see that quite explicitly.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

And then number of active parameters, this is saying, like, if I look at it, for example, a DeepSeq model, the DeepSeq v3 model has about 37 billion active parameters and then 700 billion total parameters.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

So we're focusing on just the ones that are active for a single token.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

OK, so we modeled compute performance.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

I'm going to keep writing equals, but in all of these cases, you can think of this time as being at least this much.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

And maybe there'll be some terms we ignored.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

On the memory side, what do we need to do with memory?

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

We need to fetch all of the weights.