Reiner Pope

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

Thanks.

1381.583 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

You can think of this as a schedule for the train.

1385.958 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

A new train departs every 20 milliseconds.

1387.8 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

Any passengers who are ready board the train.

1389.862 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

If the train is full, then they wait to the next train.

1392.804 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

If the train is not full, the train's going to go anyway.

1394.586 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

And so in terms of what that means for queuing latency, it means that the worst case is that a request arrives just after the train departed.

1397.849 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

It has to wait for the next train, so that's up to 20 milliseconds, and then it has to wait for that train

1407.358 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

to complete.

1413.464 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

And so the worst case latency is 40 milliseconds.

1415.326 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

So how is it 20 milliseconds derived?

1416.788 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

I mean, rule of thumb, but where it comes from is not fully explained yet, but so far we've focused on memory bandwidth and compute time.

1418.59 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

When we look at memory, the other consideration is that we want to use all of the memory capacity we have.

1430.144 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

And so generally we're going to use all of that memory capacity to store the weights or the kbs.

1434.969 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

And so we just want to read, like in the time of doing a forward pass, maybe we want to read all of the memory capacity into the chip.

1442.54 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

And so that is capacity divided by bandwidth.

1450.156 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

That tends to be 20 milliseconds on many different generations of HPM.

1453.022 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

Yeah.

1462.194 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

So for example, I mean, on I think the Rubin generation, it is something like 288 gigabytes divided by 20 terabytes per second.

1462.495 View full episode →

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

And this looks like it comes out to about 15 milliseconds.

1472.417 View full episode →

Appearances Over Time

Podcast Appearances

Sign in to Audioscrape

Share this moment