Menu
Sign In Search Podcasts Charts People & Topics Add Podcast API Blog Pricing

Reiner Pope

๐Ÿ‘ค Speaker
1157 total appearances

Appearances Over Time

Podcast Appearances

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

Yeah, right, right.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

More usefully, you can do the weight gradient step, but you can also mine Bitcoin.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

In inference, actually, the effect of pipelining on anything you care about, like batch size or latency, actually is neutral.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

It doesn't improve it.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

It doesn't make it worse.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

So if you look at the latency of this inference running it if it were pipelined versus if it were all on one rack, if it were all on one rack, we would just slide all of the boxes down and still put them in a row, and the latency would be the same.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

so um pipelining is neither better nor worse for latency um but it it does mean that you just use less memory per per rack like memory capacity because now instead of needing the whole model you only need a quarter of the model so basically no brainer to use pipelining during inference

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

So even in inference, in fact, it is not used a ton.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

It reduces your memory capacity requirements.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

There's actually a huge surplus.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

I think you're saying that a rack of Blackwell has many, many terabytes, maybe tens of terabytes.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

That's much bigger than a trillion parameter model.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

A trillion parameter model only needs one terabyte.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

And so it already fits, in fact.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

And so there's not a huge benefit from...

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

from pipelining because you're reducing a number that's already pretty small.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

But it does say that theoretically maybe you had too much memory and maybe you could have built a different hardware that has less memory, in fact.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

If you were designing your hardware and you said, I actually didn't need that much memory because I don't need the weights to fit in one rack, I can fit the weights in eight racks, then I could have maybe built a hardware that didn't have so much HPM per GPU.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

Yeah, so in the equations we had here before we raised them, we were doing memory time, so memory bandwidth and compute bandwidth.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

Let's now start looking at memory capacity.