Menu
Sign In Search Podcasts Charts People & Topics Add Podcast API Blog Pricing

Reiner Pope

๐Ÿ‘ค Speaker
1157 total appearances

Appearances Over Time

Podcast Appearances

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

Yeah, there's sort of two scenarios.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

Why don't we pick a latency that is bigger than 15 milliseconds?

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

And if I think what that means, it means I actually have time to read the HBM like twice.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

Yep.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

By the way, most of HPM accesses is reads, not writes.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

It's like, almost all reads, because the weight matrices are read-only, and then almost all of the KV cache accesses are reads.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

So, let's say I run 30 milliseconds, I can read all of HPM twice.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

But what's the point of that?

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

Like, I don't want to read the weight matrices twice.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

I don't want to read the KVs twice.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

I mean, sparsity shows up in model size, but beyond that, it only depends on sparsity, not on scale.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

Yeah.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

We can do a bit of analysis on this, which would be actually, it's like, you can think of it in terms of number of users, but maybe a more productive way to think of it is in terms of number of tokens per second.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

So what does this batch size mean in terms of tokens per second of the system?

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

So...

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

tokens per second tokens per second is going to be equal to the batch size we run a batch many tokens and then we do that every um t so every time it tools which is let's say which is uh which is this thing is equal to the 15 milliseconds 20 milliseconds number so this ends up being batch size itself times

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

Uh, about 60, so, um, like 64 times B. Um, and so this ends up being around, uh, 2000 times 64.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

So like 128, um, 128K, uh, token specific.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

So this is sort of in more digestible units.

Dwarkesh Podcast
Reiner Pope โ€“ The math behind how LLMs are trained and served

Like, uh, it's hard to reason about concurrent users, but what is the global traffic for, for a system?