Dwarkesh Patel
๐ค SpeakerAppearances Over Time
Podcast Appearances
So, as...
the length of the prefix goes up or pass your memory bandwidth time declines and that means that to the extent that you were bottlenecked on memory bandwidth before you can avoid being bottlenecked on memory bandwidth the fact that they are charging 5x less for
pre-fill then decode does suggest that they are bottlenecked on memory bandwidth to quite a degree such that for them at least, because T is equivalent to cost, right?
It's the cost of rendering a compute.
This is actually like, this would be at one and this would be at five.
That's right.
Yeah.
So it is in fact tremendously memory bandwidth bottlenecked.
The real graph looks something like, the real graph looks something like, like that.
So, yeah, let me do it this way.
Yeah, that's right.
And then this is the gap on decode between the memory and the compute time.
Yeah, yeah.
Interesting.
Another interesting one would be why cache hits are so much cheaper.
So I think, if I remember correctly, cache hits are like 10x.
It's more expensive to write to cache according to the pricing on all these models.
But if you do hit a cache, it's 10x cheaper.
So what is going on with... Presumably, this is the cost of keeping something in HBM rather than just evacuating it.
But if you do keep it in HBM, then it's cheaper to load again?