Menu
Sign In Search Podcasts Charts People & Topics Add Podcast API Blog Pricing

Reiner Pope

๐Ÿ‘ค Speaker
1157 total appearances

Appearances Over Time

Podcast Appearances

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

Thanks.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

You can think of this as a schedule for the train.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

A new train departs every 20 milliseconds.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

Any passengers who are ready board the train.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

If the train is full, then they wait to the next train.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

If the train is not full, the train's going to go anyway.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

And so in terms of what that means for queuing latency, it means that the worst case is that a request arrives just after the train departed.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

It has to wait for the next train, so that's up to 20 milliseconds, and then it has to wait for that train

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

to complete.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

And so the worst case latency is 40 milliseconds.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

So how is it 20 milliseconds derived?

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

I mean, rule of thumb, but where it comes from is not fully explained yet, but so far we've focused on memory bandwidth and compute time.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

When we look at memory, the other consideration is that we want to use all of the memory capacity we have.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

And so generally we're going to use all of that memory capacity to store the weights or the kbs.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

And so we just want to read, like in the time of doing a forward pass, maybe we want to read all of the memory capacity into the chip.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

And so that is capacity divided by bandwidth.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

That tends to be 20 milliseconds on many different generations of HPM.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

Yeah.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

So for example, I mean, on I think the Rubin generation, it is something like 288 gigabytes divided by 20 terabytes per second.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

And this looks like it comes out to about 15 milliseconds.