Dwarkesh Patel
๐ค SpeakerAppearances Over Time
Podcast Appearances
Okay, like as in a Frontier model today will actually have, during inference, have pipeline...
I know this is wrong.
I'm just trying to think out why my train of logic here is wrong.
If you have many different, you're pipelining through many different stages, the KV values are not shared between layers.
So why would it not help to be pipelining across multiple layers?
Because then you don't have to store.
You're right.
Right.
This is going back fundamentally to the point of you're not able to amortize across KV caches.
So this goes back to the question, does that mean that frontier labs, when they're doing inference, are just basically within a single scale-up?
So I guess this goes back to the question about, this goes back to the promise at the beginning of the lecture, which was, this will actually tell you about AI progress as well.
To the extent it is the case that model size scaling has been slow until recently because, let me make sure I understand the claim.
The claim would not be, you could have trained across more racks.
It was just that it would not have made sense before, like we didn't have the ability to do inference for a bigger model easily.
I was just about to ask.
So what is the going from rack to rack?
What is the latency cost per hop?
Is four a realistic number of how many pipelining stages you might have?
Wait, I guess it's 10 milliseconds per token.
That's right.