Stefano Ermon

👤 Speaker

359 total appearances

Appearances Over Time

Podcast Appearances

The Neuron: AI Explained

Diffusion for Text: Why Mercury Could Make LLMs 10x Faster

I think if you try our website and you kind of like see the animation of how the diffusion model works,

713.818 View full episode →

The Neuron: AI Explained

Diffusion for Text: Why Mercury Could Make LLMs 10x Faster

You're going to see that it constantly changes the answer and it's not one token at a time.

719.564 View full episode →

The Neuron: AI Explained

Diffusion for Text: Why Mercury Could Make LLMs 10x Faster

Many things get changed at the same time.

726.452 View full episode →

The Neuron: AI Explained

Diffusion for Text: Why Mercury Could Make LLMs 10x Faster

And that's what makes it more parallel.

729.255 View full episode →

The Neuron: AI Explained

Diffusion for Text: Why Mercury Could Make LLMs 10x Faster

It makes it much more suitable to GPUs.

731.738 View full episode →

The Neuron: AI Explained

Diffusion for Text: Why Mercury Could Make LLMs 10x Faster

GPUs are built to process many things in parallel.

735.262 View full episode →

The Neuron: AI Explained

Diffusion for Text: Why Mercury Could Make LLMs 10x Faster

They're going to like apply the same computation across different data points effectively.

738.926 View full episode →

The Neuron: AI Explained

Diffusion for Text: Why Mercury Could Make LLMs 10x Faster

And it's, you know, the kind of computation that we do

744.653 View full episode →

The Neuron: AI Explained

Diffusion for Text: Why Mercury Could Make LLMs 10x Faster

when you sample from an autoregressive model does not map well at all to a GPU.

749.753 View full episode →

The Neuron: AI Explained

Diffusion for Text: Why Mercury Could Make LLMs 10x Faster

It's a very memory bound kind of computation where you're gonna spend most of your time moving around weights from slow memory to fast memory where you can actually do the computation.

755.602 View full episode →

The Neuron: AI Explained

Diffusion for Text: Why Mercury Could Make LLMs 10x Faster

So the arithmetic intensity of the kind of inference workloads that we have today with an autoregressive model is very bad.

766.479 View full episode →

The Neuron: AI Explained

Diffusion for Text: Why Mercury Could Make LLMs 10x Faster

The utilization is very low and that's why, you know, people are building massive data centers, like because it's a, it does not, or even building custom chips, AI inference chips that are better suited for that kind of workload.

774.712 View full episode →

The Neuron: AI Explained

Diffusion for Text: Why Mercury Could Make LLMs 10x Faster

Because it's a sequential computation.

795.113 View full episode →

The Neuron: AI Explained

Diffusion for Text: Why Mercury Could Make LLMs 10x Faster

You cannot generate the third token until you've generated the first and the second.

798.758 View full episode →

The Neuron: AI Explained

Diffusion for Text: Why Mercury Could Make LLMs 10x Faster

And so it's just a structural bottleneck.

803.044 View full episode →

The Neuron: AI Explained

Diffusion for Text: Why Mercury Could Make LLMs 10x Faster

There is no way to parallelize it because there is sequential dependencies across the computation.

806.009 View full episode →

The Neuron: AI Explained

Diffusion for Text: Why Mercury Could Make LLMs 10x Faster

And so you can't process something into the future until you've generated everything before it.

811.977 View full episode →

The Neuron: AI Explained

Diffusion for Text: Why Mercury Could Make LLMs 10x Faster

And so there's just no way to parallelize that.

818.527 View full episode →

The Neuron: AI Explained

Diffusion for Text: Why Mercury Could Make LLMs 10x Faster

Exactly.

836.021 View full episode →

The Neuron: AI Explained

Diffusion for Text: Why Mercury Could Make LLMs 10x Faster

It's shifting from a memory bound

836.402 View full episode →

← Previous Page 6 of 18 Next →

Report any issue