Reiner Pope

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

And so all of those are competing for like modern racks are pushing all of those to very extreme physical limits.

2643.34 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

Yeah.

2762.182 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

Deploying in larger scale app domains is a huge unlock.

2763.003 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

I mean, I've drawn here the sort of NVIDIA Blackwell deployment.

2766.969 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

The Google deployment has actually had very large scale app domains.

2771.396 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

Not having been there at the time, I'm not sure how much is coming from successfully deploying higher sparsity ratios, which could be.

2784.676 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

It could also be... I mean, there's a whole bunch of actual modeling things of like...

2791.034 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

Specifically, how do you do the mixture of experts?

2797.028 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

We've seen the DeepSeq mixture of experts has said actually activate more experts but finer-grained experts was a big innovation.

2799.692 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

I'm sure that there are many other innovations on the model architecture as well as on the training data.

2809.909 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

It's kind of hard to disentangle all of them, but what shows up in terms of the limits of what you can do

2814.015 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

The active parameters, as we saw, is limited by the compute cost.

2821.902 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

And then the total parameters is limited by the scale-up size.

2827.91 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

Yeah, really interesting.

2861.559 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

Okay, so to answer that question, we're going to need to talk about the communication patterns.

2863.762 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

So we've talked about the mixture of expert communication pattern.

2870.793 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

That is this all-to-all.

2873.517 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

This all-to-all.

2876.361 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

All to all.

2880.762 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

All to all very strongly favors full connectivity, which is what we've kind of just shown here, and favors being within one rack.

2881.864 View full episode →

Appearances Over Time

Podcast Appearances

Sign in to Audioscrape

Share this moment