Dwarkesh Patel
๐ค SpeakerAppearances Over Time
Podcast Appearances
So if you can, I would really recommend watching it on a video platform like YouTube.
Okay, full disclosure, I am an angel investor in Maddox, but that's unrelated to this podcast.
Reiner, maybe to kick us off, I'll ask this question.
So, we have a couple of companies like Claude and Codex and Cursor offering something like Fast Mode, where for 6x the price, they'll stream you tokens at 2.5x the speed.
Mechanically, I'm curious what's going on here.
Why is it the case that you can pay more to get faster latency?
And two, could you keep going?
Could you pay 100x more and somehow get even faster speeds or much, much faster speeds?
And three, could you go the other way?
Could you have something like cloud code slow mode where if you are willing to wait for minutes on end, you could get even cheaper prices?
So maybe this will help motivate the kind of analysis that you'll be doing through the lecture.
Maybe I'll just interrupt from time to time to ask some very naive questions or to clarify some basic points, but...
just for the audience, you're not serving one user at a time.
The batch refers to the fact that you're serving many different users at the same time.
Yeah.
And that's a whole batch.
And maybe just back in, let's just explain what the KV cache is real quick.
It seems like the way you've drawn the slopes for...
compute time and how the kb grows and what implication the kb has on memory uh time that as yeah what if this were above or below or yeah or is that necessarily the case because if this is always true then this batch size grows compute always dominates uh kb and which which suggests that if you have big enough batch size maybe memory is never an issue
And is there something especially significant about the slope being exactly the slope of the...