Reiner Pope

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

So there are really two things I need to do in the compute.

208.488 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

I need to multiply by all of the active parameters, and then I need to do some work on the attention.

211.03 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

So multiplying by all the active parameters, I have a certain batch size that I'm running, and then I've got a number of active parameters in my model.

217.196 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

And then I'm just going to divide this by the compute throughput, which is the flops of the chip.

227.365 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

So this is a hardware constant.

234.732 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

So this actually accounts for all of the compute time for all of the weight matrix multiplies.

237.614 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

There's a little caveat here.

244.124 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

We've sort of ignored the time to do any of the attention computation, but that in general will be quite small in comparison to this.

245.446 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

So we'll ignore this.

252.076 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

Yeah, so I can motivate the batch at least a little bit.

266.876 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

So, I mean, we will see exactly why batch is such a favorable optimization.

268.778 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

But what will turn out to be the case is that if you do not batch together many users, the cost and the economics you get can be like a thousand times worse than if you do batch many users together.

273.043 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

And we'll be able to see that quite explicitly.

285.639 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

And then number of active parameters, this is saying, like, if I look at it, for example, a DeepSeq model, the DeepSeq v3 model has about 37 billion active parameters and then 700 billion total parameters.

287.581 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

So we're focusing on just the ones that are active for a single token.

301.12 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

OK, so we modeled compute performance.

305.446 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

I'm going to keep writing equals, but in all of these cases, you can think of this time as being at least this much.