Reiner Pope

So if you look at the latency of this inference running it if it were pipelined versus if it were all on one rack, if it were all on one rack, we would just slide all of the boxes down and still put them in a row, and the latency would be the same.

3669.746 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

so um pipelining is neither better nor worse for latency um but it it does mean that you just use less memory per per rack like memory capacity because now instead of needing the whole model you only need a quarter of the model so basically no brainer to use pipelining during inference

3681.667 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

So even in inference, in fact, it is not used a ton.

3705.169 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

It reduces your memory capacity requirements.

3708.794 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

There's actually a huge surplus.

3712.239 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

I think you're saying that a rack of Blackwell has many, many terabytes, maybe tens of terabytes.

3713.821 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

That's much bigger than a trillion parameter model.

3722.252 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

A trillion parameter model only needs one terabyte.

3725.597 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

And so it already fits, in fact.

3727.9 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

And so there's not a huge benefit from...

3729.923 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

from pipelining because you're reducing a number that's already pretty small.

3732.106 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

But it does say that theoretically maybe you had too much memory and maybe you could have built a different hardware that has less memory, in fact.

3736.873 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

If you were designing your hardware and you said, I actually didn't need that much memory because I don't need the weights to fit in one rack, I can fit the weights in eight racks, then I could have maybe built a hardware that didn't have so much HPM per GPU.

3746.468 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

Yeah, so in the equations we had here before we raised them, we were doing memory time, so memory bandwidth and compute bandwidth.

3862.937 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

Let's now start looking at memory capacity.

3869.764 View full episode →

Appearances Over Time

Podcast Appearances

Sign in to Audioscrape

Share this moment