Azeem Azhar
👤 SpeakerAppearances Over Time
Podcast Appearances
And you know exactly what that is like because you have used these tools.
word after word after word, sequentially at a time, like a slow teletype from the old days.
The model is producing tokens one at a time, each depending on the previous one.
This can't be parallelized.
It's structurally sequential.
The bottleneck here now is no longer raw compute.
It's memory bandwidth.
How fast can you stream the model's weights from memory for each individual token step?
So now you have lots of GPU calls sitting largely idle waiting on memory reads just to produce one token at a time.
So GPUs are not fantastic here.
They weren't designed for this.
And as we move from the training era to the inference era, well, the workflows are shifting from building models to running them constantly at scale for billions of users and lots of agents.
That efficiency becomes a serious problem.
That's why you acquire Grog,
do whatever deal you did with grok it's a fast move a high conviction move by an incumbent to ensure it can serve that changing market a quick note if you want to support us in bringing more of these conversations to the world please consider subscribing to the show
So NVIDIA is going to make some new chips or systems with GROP technology embedded later this year.
The point is that that combined architecture uses NVIDIA's homegrown Vero Rubin GPUs and GROP processing units, and it will result in a 35-fold improvement in the throughput per megawatt of power versus NVIDIA's current Blackwells, which are the posh chips of the moment.
This isn't the first time NVIDIA has done something like this.
They acquired Mellanox.
It was a new networking company and it turned into a real advantage for NVIDIA.