Reiner Pope

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

This becomes the dominant term.

4317.841 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

The KB cache becomes the dominant term.

4319.262 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

Yeah, you only need to store like one layer rather than two layers of KVs, right?

4338.098 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

Yeah.

4341.224 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

So it helps from that perspective.

4341.665 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

Yeah.

4344.47 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

What's competing with that, though, is that you need to be keeping all of the racks usefully busy at a time.

4345.799 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

And so the number of sequences that are in flight simultaneously has gone up.

4351.067 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

Yeah, yeah, yeah.

4355.895 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

Makes sense, makes sense, makes sense.

4356.255 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

So those exactly cancel, and you end up not getting a saving per GP.

4357.397 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

Well, so first we did you can't amortize KV caches across batch size.

4366.825 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

And now we're saying you also can't shard it across pipeline stages.

4371.173 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

It sucks from both of those points of view.

4376.323 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

Yeah, yeah, yeah.

4378.968 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

Interesting.

4379.348 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

Okay, so then what is that during inference?

4380.17 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

So, I mean, the DeepSeq paper reports what they do, which is they just do a lot of expert parallelism.

4382.594 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

You should...

4388.726 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

In effect, you should increase your expert parallelism up to your scale-up domain size, and then do very little pipelining.

4390.298 View full episode →

Appearances Over Time

Podcast Appearances

Sign in to Audioscrape

Share this moment