Menu
Sign In Search Podcasts Charts People & Topics Add Podcast API Blog Pricing

Reiner Pope

๐Ÿ‘ค Speaker
1157 total appearances

Appearances Over Time

Podcast Appearances

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

1.3b dense yeah across you end up seeing that for example in this case the 64 expert 370 million activated parameters model is as good as a dense 1.3 billion model so in some sense it's actually not amazing returns where you need to increase total parameters a hundredfold to get the equivalent of

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

Yeah, I mean, actually, even more so, yeah.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

It's a huge increase in parameter count for a modest increase in... Yeah, so in this case, actually, what is it, 4x?

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

64x for 4x.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

So is that good or bad, actually?

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

Even from a memory point of view, keep in mind you are doubling this portion of the memory fetches, which is amortized by batch.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

And so just keep running out larger batch size.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

From the point of view of the analysis we've done here, this is pure win.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

Keep doing it.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

Keep doing it until you run out of available users, basically.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

So there's actually this equivalence between...

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

If I want to go sparse, or if I have a lot of users, I can go to a much sparser model.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

So from that point of view, it's a reasonable trade-off.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

The other trade-off that shows up here is that it also consumes memory capacity, which we've only reasoned about memory-bound with it, but it also consumes memory capacity.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

Yeah, so, I mean, maybe this would be a good point to actually talk about how a mixture of experts layer is typically layered out on a rack of GPUs or something like that.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

Yeah, yeah, makes sense.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

Yeah, where were we?

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

Sparse mixture of experts.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

Yes.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

Maybe how we lay that out on a GPU.