Stefano Ermon
๐ค SpeakerAppearances Over Time
Podcast Appearances
I think if you try our website and you kind of like see the animation of how the diffusion model works,
You're going to see that it constantly changes the answer and it's not one token at a time.
Many things get changed at the same time.
And that's what makes it more parallel.
It makes it much more suitable to GPUs.
GPUs are built to process many things in parallel.
They're going to like apply the same computation across different data points effectively.
And it's, you know, the kind of computation that we do
when you sample from an autoregressive model does not map well at all to a GPU.
It's a very memory bound kind of computation where you're gonna spend most of your time moving around weights from slow memory to fast memory where you can actually do the computation.
So the arithmetic intensity of the kind of inference workloads that we have today with an autoregressive model is very bad.
The utilization is very low and that's why, you know, people are building massive data centers, like because it's a, it does not, or even building custom chips, AI inference chips that are better suited for that kind of workload.
Because it's a sequential computation.
You cannot generate the third token until you've generated the first and the second.
And so it's just a structural bottleneck.
There is no way to parallelize it because there is sequential dependencies across the computation.
And so you can't process something into the future until you've generated everything before it.
And so there's just no way to parallelize that.
Exactly.
It's shifting from a memory bound