Menu
Sign In Search Podcasts Charts People & Topics Add Podcast API Blog Pricing

Stefano Ermon

๐Ÿ‘ค Speaker
359 total appearances

Appearances Over Time

Podcast Appearances

The Neuron: AI Explained
Diffusion for Text: Why Mercury Could Make LLMs 10x Faster

I think if you try our website and you kind of like see the animation of how the diffusion model works,

The Neuron: AI Explained
Diffusion for Text: Why Mercury Could Make LLMs 10x Faster

You're going to see that it constantly changes the answer and it's not one token at a time.

The Neuron: AI Explained
Diffusion for Text: Why Mercury Could Make LLMs 10x Faster

Many things get changed at the same time.

The Neuron: AI Explained
Diffusion for Text: Why Mercury Could Make LLMs 10x Faster

And that's what makes it more parallel.

The Neuron: AI Explained
Diffusion for Text: Why Mercury Could Make LLMs 10x Faster

It makes it much more suitable to GPUs.

The Neuron: AI Explained
Diffusion for Text: Why Mercury Could Make LLMs 10x Faster

GPUs are built to process many things in parallel.

The Neuron: AI Explained
Diffusion for Text: Why Mercury Could Make LLMs 10x Faster

They're going to like apply the same computation across different data points effectively.

The Neuron: AI Explained
Diffusion for Text: Why Mercury Could Make LLMs 10x Faster

And it's, you know, the kind of computation that we do

The Neuron: AI Explained
Diffusion for Text: Why Mercury Could Make LLMs 10x Faster

when you sample from an autoregressive model does not map well at all to a GPU.

The Neuron: AI Explained
Diffusion for Text: Why Mercury Could Make LLMs 10x Faster

It's a very memory bound kind of computation where you're gonna spend most of your time moving around weights from slow memory to fast memory where you can actually do the computation.

The Neuron: AI Explained
Diffusion for Text: Why Mercury Could Make LLMs 10x Faster

So the arithmetic intensity of the kind of inference workloads that we have today with an autoregressive model is very bad.

The Neuron: AI Explained
Diffusion for Text: Why Mercury Could Make LLMs 10x Faster

The utilization is very low and that's why, you know, people are building massive data centers, like because it's a, it does not, or even building custom chips, AI inference chips that are better suited for that kind of workload.

The Neuron: AI Explained
Diffusion for Text: Why Mercury Could Make LLMs 10x Faster

Because it's a sequential computation.

The Neuron: AI Explained
Diffusion for Text: Why Mercury Could Make LLMs 10x Faster

You cannot generate the third token until you've generated the first and the second.

The Neuron: AI Explained
Diffusion for Text: Why Mercury Could Make LLMs 10x Faster

And so it's just a structural bottleneck.

The Neuron: AI Explained
Diffusion for Text: Why Mercury Could Make LLMs 10x Faster

There is no way to parallelize it because there is sequential dependencies across the computation.

The Neuron: AI Explained
Diffusion for Text: Why Mercury Could Make LLMs 10x Faster

And so you can't process something into the future until you've generated everything before it.

The Neuron: AI Explained
Diffusion for Text: Why Mercury Could Make LLMs 10x Faster

And so there's just no way to parallelize that.

The Neuron: AI Explained
Diffusion for Text: Why Mercury Could Make LLMs 10x Faster

Exactly.

The Neuron: AI Explained
Diffusion for Text: Why Mercury Could Make LLMs 10x Faster

It's shifting from a memory bound