Menu
Sign In Search Podcasts Charts People & Topics Add Podcast API Blog Pricing

Reiner Pope

๐Ÿ‘ค Speaker
1157 total appearances

Appearances Over Time

Podcast Appearances

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

And so all of those are competing for like modern racks are pushing all of those to very extreme physical limits.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

Yeah.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

Deploying in larger scale app domains is a huge unlock.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

I mean, I've drawn here the sort of NVIDIA Blackwell deployment.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

The Google deployment has actually had very large scale app domains.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

Not having been there at the time, I'm not sure how much is coming from successfully deploying higher sparsity ratios, which could be.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

It could also be... I mean, there's a whole bunch of actual modeling things of like...

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

Specifically, how do you do the mixture of experts?

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

We've seen the DeepSeq mixture of experts has said actually activate more experts but finer-grained experts was a big innovation.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

I'm sure that there are many other innovations on the model architecture as well as on the training data.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

It's kind of hard to disentangle all of them, but what shows up in terms of the limits of what you can do

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

The active parameters, as we saw, is limited by the compute cost.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

And then the total parameters is limited by the scale-up size.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

Yeah, really interesting.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

Okay, so to answer that question, we're going to need to talk about the communication patterns.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

So we've talked about the mixture of expert communication pattern.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

That is this all-to-all.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

This all-to-all.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

All to all.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

All to all very strongly favors full connectivity, which is what we've kind of just shown here, and favors being within one rack.