Jeff Dean
๐ค SpeakerAppearances Over Time
Podcast Appearances
But I mean, not that you would need that human understanding to figure out how to work the thing at runtime because you just have some sort of learned router that's
I mean, for any sort of even existing mixtures of experts, you want the whole thing in memory.
I mean, basically, if you are, I guess there's kind of this misconception running around with like mixture of experts that, okay, the benefit is that
you don't even have to go through those weights in the model if some expert is unused, it doesn't mean that you don't have to retrieve that memory.
Because really, in order to be efficient, you're serving at very large batch sizes.
Of independent requests.
Right, of independent requests.
So it's not really the case that, OK, at this step, you're either looking at this expert or you're not looking at this expert.
Because if that were the case, then when you did look at the expert, you would be running it at batch size one, which is massively
Inefficient.
Like you've got modern hardware, the operational intensities are whatever, hundreds.
So that's not what's happening.
It's that you are looking at all the experts, but you only have to send a small fraction of the batch through each one.
You definitely want to have at least enough HBM to put your whole model.
So depending on the size of your model, most likely that's how much โ
That's how much HBM you'd want to have at a minimum.
I mean, yeah.
I mean that even the data control modularity stuff seems like really cool because then you could have like the piece of the model that's just trained for me.
So it knows all my private data.
We're going to need like a million automated researchers to invent all of this stuff.