Reiner Pope

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

Maybe none at all, maybe two, just enough to make the weight storage not too big of an issue.

4398.527 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

Those are the only two parallelisms that really make sense.

4405.414 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

In the past, there was tensor parallelism, which was cutting up within an expert, but the experts are so small now that that is not a profitable optimization.

4407.597 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

Yes.

4425.348 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

Yeah, I mean, you can look at how it depends on model size.

4425.909 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

Like, you could have a very large model, like one that exceeds the memory of a rack, and there you should be doing a bit of pipelining.

4429.634 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

Maybe it's extremely sparse, for example, and that would be a reason to do it.

4441.629 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

Actually, so pipelining doesn't help with context length.

4473.83 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

It totally helps with model size.

4478.681 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

And so because of the ability to do pipelining, at least a rack should not be a constraint on your ability to fit the model parameters.

4480.145 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

I guess the other consideration you're asking like, why hasn't it scaled up more and why did bigger scale-up domains help?

4489.591 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

So we talked through one aspect of that, which is, we kind of said it's not because of memory capacity.

4495.658 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

We have a solution to the memory capacity, at least with respect to model size, not with respect to KV cache size, but at least with respect to model size, we have a solution to memory capacity.

4501.485 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

The other issue that shows up is latency.

4512.819 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

This is very much dependent on the hardware.

4523.346 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

It's