Reiner Pope
π€ SpeakerAppearances Over Time
Podcast Appearances
Okay.
So we have, this is the, like, if we're just storing it in HBM, it has this sort of cost profile.
And then if we store in DDR, it's actually going to take some time.
So it's like, we get the same thing here, but it's,
writes per token over DDR capacity times DDR cost per second.
But now this has a cost to retrieve that is higher than the HPM because we need to copy it into the HPR.
And so this is writes per token
over DDR bandwidth.
And then this consumes some amount of the DDR as well.
And every scale-up has DDR and Flash?
There's really a deployment question, and so you can choose that.
NVIDIA does deploy in this form.
It has both.
Yeah, I mean, it depends on what you define a retrieve to be.
Here I'm defining retrieve to be move it into HBM so that you can start actually doing inference on it.
Sort of by definition.
So these are three things, and I guess I ordered them wrong.
In general, if you're balancing two costs and you've got different tiers in the memory hierarchy, you should expect as this cost goes up, this cost should go down.
So you can kind of see where the zeros are, and I should have ordered them this one first, this one second, and this one third.
So...