Menu
Sign In Search Podcasts Charts People & Topics Add Podcast API Blog Pricing

Reiner Pope

๐Ÿ‘ค Speaker
1157 total appearances

Appearances Over Time

Podcast Appearances

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

So I'll just make that claim and move on.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

So we're going to say that the cost of training

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

plus the cost of inference.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

We want to equalize these.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

We'll do pre-training only first because it's a little... Well, actually, we can do all of it in general.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

So actually, we'll customize.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

Cost of pre-training.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

So number of active params times the data on pre-training.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

So that's the cost of pre-training.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

There's a factor of six out here, which is the number of flops.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

This is the famous 6ND formula.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

And then in RL, we have approximately the same thing.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

We've got like same number of active parameters, but now it's the amount of data is the RL data.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

There's this extra like efficiency multiplier, which is, or inefficiency, like the inefficiency multiplier.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

Well, yeah, there's that.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

And then the other, perhaps even bigger inefficiency is that this involves a substantial amount of decode and often decode runs at less MFU than training.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

Like, this could be somewhere, so... It would at least be 2.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

Yeah, somewhere in the range of 2 to 6.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

So we'll just, like, we'll say somewhere in the range of 2 to 6 and leave it at that.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

Yeah.