Menu
Sign In Search Podcasts Charts People & Topics Add Podcast API Blog Pricing
Podcast Image

Dwarkesh Podcast

Reiner Pope – The math behind how LLMs are trained and served

29 Apr 2026

Transcription

Chapter 1: What is the format of the interview with Reiner Pope?

0.031 - 17.207 Dwarkesh Patel

Today, I'm interviewing Rainer Pope, who is CEO of Maddox, which is a new chip startup. Previously, he was doing TPU architecture and many other things at Google. This is a very different format from my usual interviews. This is going to be a Blackboard lecture. We're going to get up in a second. We, in fact, built this whole new studio with specifically this format in mind.

0

17.809 - 24.861 Dwarkesh Patel

And so it's a pleasure to get to inaugurate it with you. We're going to be talking about model architecture, ML, and for many other things.

0

24.881 - 45.21 Dwarkesh Patel

And the reason I think it's an important topic is because once you actually understand how training and inference actually work in a cluster, as we'll see a lot of things about why AI is the way it is, why AI architectures are the way they are, why API prices are the way they are, fundamentally also how why AI progress is the way it is. start making sense.

0

45.23 - 61.459 Dwarkesh Patel

And you need to understand the details to get there. And you need a blackboard to understand the details. So Reiner, thank you so much for doing this. Yeah, very happy to be here. Just a heads up, this is a lecture with graphs and equations and all that stuff. So if you can, I would really recommend watching it on a video platform like YouTube.

0

61.439 - 83.258 Dwarkesh Patel

Okay, full disclosure, I am an angel investor in Maddox, but that's unrelated to this podcast. Reiner, maybe to kick us off, I'll ask this question. So, we have a couple of companies like Claude and Codex and Cursor offering something like Fast Mode, where for 6x the price, they'll stream you tokens at 2.5x the speed. Mechanically, I'm curious what's going on here.

Chapter 2: Why is understanding model architecture important?

83.618 - 105.582 Dwarkesh Patel

Why is it the case that you can pay more to get faster latency? And two, could you keep going? Could you pay 100x more and somehow get even faster speeds or much, much faster speeds? And three, could you go the other way? Could you have something like cloud code slow mode where if you are willing to wait for minutes on end, you could get even cheaper prices?

0

106.082 - 109.225 Dwarkesh Patel

So maybe this will help motivate the kind of analysis that you'll be doing through the lecture.

0

109.466 - 124.673 Reiner Pope

Great. I mean, a little bit to jump to the conclusion, the big effect is batch size, but what we're going to do now is quantify exactly what that looks like and what its implications are on latency and cost. There's going to be another effect, which is, you can call it speculative decoding or multi-token prediction.

0

125.214 - 147.522 Reiner Pope

We can maybe come back to that later, but I think the first thing that we'll talk through is batch size. So what I'd like to introduce is sort of the two principles of analysis. Firstly, we're going to look at a roofline analysis of how I run a transformer model on a cluster of chips. We'll take a sort of, let's say, a Blackwell NVL72 cluster, so a rack of 72 GPUs.

0

147.502 - 166.103 Reiner Pope

And so the roofline analysis means we look at memory bandwidth and compute performance. And then the other side of that is that we're going to look at just two simple factors of the model, which are the time to operate on the weights and then the time to operate on the context, the KB cache. So let's jump in.

166.643 - 187.664 Reiner Pope

What we're going to try and do is we're going to try and estimate the time that it takes to run an inference of a certain shape. Now, We're not perfect here. We can't exactly predict the time. And so instead, we're going to approximate. And so we're going to say that the time must be greater than or equal to a certain quantity. And so we're going to consider two different aspects.

187.684 - 210.99 Reiner Pope

We're going to look at the time it takes to do the memory fetches and then the time it takes to do the compute. And it'll turn out that this actually gives us a very strong predictive power, even with a simple one. So one by one, what is the time that it takes to do the compute? So there are really two things I need to do in the compute.

211.03 - 235.913 Reiner Pope

I need to multiply by all of the active parameters, and then I need to do some work on the attention. So multiplying by all the active parameters, I have a certain batch size that I'm running, and then I've got a number of active parameters in my model. And then I'm just going to divide this by the compute throughput, which is the flops of the chip. So this is a hardware constant.

237.614 - 252.937 Reiner Pope

So this actually accounts for all of the compute time for all of the weight matrix multiplies. There's a little caveat here. We've sort of ignored the time to do any of the attention computation, but that in general will be quite small in comparison to this. So we'll ignore this.

Chapter 3: How does batch size affect token cost and speed?

319.224 - 349.473 Reiner Pope

We need to fetch all of the weights. And so there is some time to fetch all of the total number of parameters, not just the active parameters. So there's weight fetch time. And then in addition, there's a kvcache fetch time. So this actually depends on batch size. So for every element of the batch, we have to fetch an entire context length. worth of tokens, and then there's a size per token.

0

349.513 - 356.781 Reiner Pope

So like bytes, bytes for one token. And so there's a model parameter.

0

356.801 - 361.346 Dwarkesh Patel

And maybe just back in, let's just explain what the KV cache is real quick.

0

361.466 - 385.307 Reiner Pope

Yeah. So when I do a forward pass, let me draw actually how the autoregressive inference works. So this is during decode. So if I think I have a bunch of tokens of text, I'm drawing a tensor because ultimately the tokens are represented as some like tensor of in some embedding dimension. And then in this direction, I have the sequence length.

0

387.818 - 411.042 Reiner Pope

The work of running a decode is I have to run each token through a whole bunch of matrix multipliers over a bunch of different layers. And in general, I'm going to have to do that work over all of these tokens. But then one step of decode is actually to produce just this one additional token out here.

411.376 - 434.294 Reiner Pope

And so what I'm going to do there is I'm going to run a full forwards pass of multiplying by all of the white matrices in the entire model. But then I've got this attention mechanism where this token sort of, it's like looking at all of the past tokens in this way. And what is it looking at specifically? It is looking at some internal representation that the model has produced of the tokens.

434.775 - 461.633 Reiner Pope

And we call that the KB cache. So this process of attending, this single token attending to all of the history of tokens, that's attention. It is mostly dominated by memory fetches rather than matrix multiplies. So we've got the amount of memory that we're fetching shown over here. And then there's, of course, just then divided by the memory bandwidth. So the memory bytes per second.

464.566 - 489.352 Reiner Pope

So in fact, these equations here are actually enough for us to now draw some fit lines. And so the things that we'd like to look at are sensitivity to batch, and then also, which we'll draw separately to context links. So we said that the big effects you can get is like some trade-off in latency versus cost in batch size. So let's draw them out.

489.813 - 518.879 Reiner Pope

I think there's just really two graphs we want to draw. We'll first just draw batch size versus time here. So when we look at the shape of this, we've got a maximum of the sum and then another term. So let's look at these terms one by one and how they scale the time for compute and memory and how they show up. So let's first look at this compute time.

Chapter 4: What is the relationship between batch size and model performance?

1361.398 - 1387.68 Reiner Pope

I know that it's going to take, for example, maybe it's something like 20 milliseconds is a common place to sense up landing. What I'm going to produce is, uh, like, so this is a timeline of what is running on the GPU. It's going to start a new batch every 20 milliseconds, uh, regardless. And so, uh, so, so each of this is 20, this is 40. Thanks. You can think of this as a schedule for the train.

0

1387.8 - 1406.918 Reiner Pope

A new train departs every 20 milliseconds. Any passengers who are ready board the train. If the train is full, then they wait to the next train. If the train is not full, the train's going to go anyway. And so in terms of what that means for queuing latency, it means that the worst case is that a request arrives just after the train departed.

0

1407.358 - 1429.583 Reiner Pope

It has to wait for the next train, so that's up to 20 milliseconds, and then it has to wait for that train to complete. And so the worst case latency is 40 milliseconds. So how is it 20 milliseconds derived? I mean, rule of thumb, but where it comes from is not fully explained yet, but so far we've focused on memory bandwidth and compute time.

0

1430.144 - 1453.001 Reiner Pope

When we look at memory, the other consideration is that we want to use all of the memory capacity we have. And so generally we're going to use all of that memory capacity to store the weights or the kbs. And so we just want to read, like in the time of doing a forward pass, maybe we want to read all of the memory capacity into the chip. And so that is capacity divided by bandwidth.

0

1453.022 - 1456.489 Reiner Pope

That tends to be 20 milliseconds on many different generations of HPM.

1457.05 - 1462.174 Dwarkesh Patel

The units make sense. You would have a... A byte divided by bytes per second.

1462.194 - 1479.432 Reiner Pope

Yeah. So for example, I mean, on I think the Rubin generation, it is something like 288 gigabytes divided by 20 terabytes per second. And this looks like it comes out to about 15 milliseconds.

1482.162 - 1512.03 Dwarkesh Patel

Let me just make sure I understand what it's saying. I mean, I understand why the units can't, the sort of unit analysis. But what it's saying is, we can evacuate and replace the HBM in this amount of time. And so we don't want to be in a situation where the HBM is not big enough that we're not actually able to keep write everything you want to it or take everything out of it.

1512.151 - 1517.52 Dwarkesh Patel

Or we don't want to be in a situation where our ability to write back and forth is so big, or sorry, so small compared.

Chapter 5: How does batch size impact token cost and speed?

4034.725 - 4055.367 Reiner Pope

And somehow we're going to arrange, I'll hand wave exactly how, somehow we can arrange the same perfect sharding of the contexts across GPUs in a rack and based on layer across racks. And sorry, four is the number of racks. Yeah, for example. So...

0

4056.123 - 4074.785 Reiner Pope

This is the place where we actually need to go back and analyze this batch size B. And you were making this comment that there's micro-batching versus global batching. So let's come back to this pipelining diagram here. We've got one batch going forward here. And then as I drew it, it kind of just like disappeared. That's not really correct.

0

4074.925 - 4091.915 Reiner Pope

If you think about how decode is working, I have a bunch of tokens that I have generated already. I do one forwards pass where I generate a new token. And then, and then I push, like, then I write that to my KB cache and then I do another forwards pass that generates the next token.

0

Chapter 6: What is the significance of micro-batching in model training?

4092.696 - 4108.793 Reiner Pope

So I'm actually going to be running this batch zero in a loop. So in fact, I go forwards. Once I finish, I can start the next iteration of the loop up here. Yeah. So we'll just fill this in. We'll have the.

0

Chapter 7: Why is pipeline parallelism important for model efficiency?

4114.291 - 4157.394 Reiner Pope

Nice. Yeah, so we've got the two and three. So let's split this batch. This batch will be the global batch size. So B is going to be the number of micro batches. times the batch size per micro-batch. So how many micro-batches do we need? So the number of micro-batches in this diagram is four, zero, one, two, three. And then the micro-batch size This is still this, like, 2000-ish number.

0

4158.395 - 4170.393 Reiner Pope

This is the one that is, like... This is the, like, 2000 times sparsity. Sorry, no, this is the 300 times sparsity. 300 times sparsity.

0

4171.054 - 4174.099 Dwarkesh Patel

This is how big the train that takes up every 20 milliseconds is.

0

4174.433 - 4195.677 Reiner Pope

Right, yes. This is going to be the 20 milliseconds train. So the global batch size is the number of micro-batches times the local batch size. Local batch size is set by this hardware parameter. The number of micro-batches, well, the number of micro-batches is as small as possible such that we can wrap around and not leave any idle time when we wrap around.

0

4196.318 - 4214.305 Reiner Pope

So if we had fewer, we would have this idle time when we wrap around. And so you can sort of just visually see that it is equal to the number of pipeline stages. I mean, sort of proof by visual here, like it is four and it's four this way as well, but you can sort of look and see that it goes along here and then it wraps around a number of pipeline stages.

4214.325 - 4223.762 Dwarkesh Patel

Yeah, and sorry, a very basic question. This is what is actually done? Mm-hmm. Okay, like as in a Frontier model today will actually have, during inference, have pipeline...

4223.742 - 4253.057 Reiner Pope

For sure, during massive-scale training, this is done. It can be done for inference. I'm actually going to make the case for why it is less attractive. It is useful for weights, but not so useful for KVs. The big challenge is, so let's fill this in. The micro-batch size here ends up being equal to the number of pipeline stages. When we go back and substitute this, all of that into here,

4258.555 - 4290.656 Reiner Pope

We get a number of pipeline stages times this little b showing up in here. And then when we factor this out, I'm going to split this plus into two terms. We get the full division by e times p over here. We still have division by e times p over here, but the p's cancel, this p and this p. They canceled.

4292.018 - 4306.03 Reiner Pope

And so what we find, if you increase the number of pipeline stages, the memory footprint for the number of weights keeps going down and down and down. Of course. But the memory footprint for the number of activations stays constant. So it doesn't actually work.

Chapter 8: How do neural networks and cryptography share similarities?

4361.563 - 4366.491 Dwarkesh Patel

This is going back fundamentally to the point of you're not able to amortize across KV caches.

0

4366.825 - 4390.048 Reiner Pope

Well, so first we did you can't amortize KV caches across batch size. And now we're saying you also can't shard it across pipeline stages. It sucks from both of those points of view. Yeah, yeah, yeah. Interesting. Okay, so then what is that during inference? So, I mean, the DeepSeq paper reports what they do, which is they just do a lot of expert parallelism. You should...

0

4390.298 - 4407.436 Reiner Pope

In effect, you should increase your expert parallelism up to your scale-up domain size, and then do very little pipelining. Maybe none at all, maybe two, just enough to make the weight storage not too big of an issue. Those are the only two parallelisms that really make sense.

0

4407.597 - 4418.348 Reiner Pope

In the past, there was tensor parallelism, which was cutting up within an expert, but the experts are so small now that that is not a profitable optimization.

0

4418.8 - 4424.507 Dwarkesh Patel

So this goes back to the question, does that mean that frontier labs, when they're doing inference, are just basically within a single scale-up?

4425.348 - 4444.993 Reiner Pope

Yes. Yeah, I mean, you can look at how it depends on model size. Like, you could have a very large model, like one that exceeds the memory of a rack, and there you should be doing a bit of pipelining. Maybe it's extremely sparse, for example, and that would be a reason to do it.

4445.108 - 4466.196 Dwarkesh Patel

So I guess this goes back to the question about, this goes back to the promise at the beginning of the lecture, which was, this will actually tell you about AI progress as well. To the extent it is the case that model size scaling has been slow until recently because, let me make sure I understand the claim. The claim would not be, you could have trained across more racks.

4466.876 - 4473.445 Dwarkesh Patel

It was just that it would not have made sense before, like we didn't have the ability to do inference for a bigger model easily.

4473.83 - 4494.497 Reiner Pope

Actually, so pipelining doesn't help with context length. It totally helps with model size. And so because of the ability to do pipelining, at least a rack should not be a constraint on your ability to fit the model parameters. I guess the other consideration you're asking like, why hasn't it scaled up more and why did bigger scale-up domains help?

Comments

There are no comments yet.

Please log in to write the first comment.