Stefano Ermon

Diffusion for Text: Why Mercury Could Make LLMs 10x Faster

regime to a compute bound regime where you basically are bounded by the number of flops that you have available on the GPU, which is a much easier quantity to scale.

838.347 View full episode →

The Neuron: AI Explained

Diffusion for Text: Why Mercury Could Make LLMs 10x Faster

Like if you were a chief manufacturer, it's a lot easier to add flops that do increase memory bandwidth.

848.19 View full episode →

The Neuron: AI Explained

Diffusion for Text: Why Mercury Could Make LLMs 10x Faster

Yeah, some of the labs are very entrenched in a certain stack.

862.281 View full episode →

The Neuron: AI Explained

Diffusion for Text: Why Mercury Could Make LLMs 10x Faster

And so there's a big cost if they were to switch to something different.

866.188 View full episode →

The Neuron: AI Explained

Diffusion for Text: Why Mercury Could Make LLMs 10x Faster

There is quite a bit of secret sauce involved in terms of like, what is the right way to train these models?

871.156 View full episode →

The Neuron: AI Explained

Diffusion for Text: Why Mercury Could Make LLMs 10x Faster

What is the right way to...

877.127 View full episode →

The Neuron: AI Explained

Diffusion for Text: Why Mercury Could Make LLMs 10x Faster

you know, even just sample from them.

879.05 View full episode →

The Neuron: AI Explained

Diffusion for Text: Why Mercury Could Make LLMs 10x Faster

Like it's not as obvious as, okay, generate one token, append it, generate the next one, append it.

881.393 View full episode →

The Neuron: AI Explained

Diffusion for Text: Why Mercury Could Make LLMs 10x Faster

There is not much you can do on the inference side if you have a traditional autoregressive model.

886.841 View full episode →

The Neuron: AI Explained

Diffusion for Text: Why Mercury Could Make LLMs 10x Faster

But on a diffusion model, the design space for inference algorithms is much broader.

893.65 View full episode →

The Neuron: AI Explained

Diffusion for Text: Why Mercury Could Make LLMs 10x Faster

And then there is also the issue of kind of like the, at the MLCs level, like if you think about actually serving production workloads, there is a decent amount of open source and closed source, of course, solutions for ultra-aggressive models.

900.211 View full episode →

The Neuron: AI Explained

Diffusion for Text: Why Mercury Could Make LLMs 10x Faster

Things like VLLM, SGLang, TensorRT, like there is pretty mature serving stacks for ultra-aggressive models.

916.592 View full episode →

The Neuron: AI Explained

Diffusion for Text: Why Mercury Could Make LLMs 10x Faster

For diffusion models, it's much more,

925.984 View full episode →

The Neuron: AI Explained

Diffusion for Text: Why Mercury Could Make LLMs 10x Faster

You know, it's much earlier, we have our own stock, but, uh, you know, it takes a significant amount of work.

928.527 View full episode →

The Neuron: AI Explained

Diffusion for Text: Why Mercury Could Make LLMs 10x Faster

If you want to figure out how to actually make things efficient in practice on real world GPUs and there's all kinds of optimizations that you can do on the systems.

936.276 View full episode →

The Neuron: AI Explained

Diffusion for Text: Why Mercury Could Make LLMs 10x Faster

Yeah.

990.157 View full episode →

The Neuron: AI Explained

Diffusion for Text: Why Mercury Could Make LLMs 10x Faster

Basically, if you can generate more tokens per second, what this means is that, you know, for the same price,

990.357 View full episode →

The Neuron: AI Explained

Diffusion for Text: Why Mercury Could Make LLMs 10x Faster

amount of hardware for the same number of GPUs, you can produce more tokens.

996.854 View full episode →

The Neuron: AI Explained

Diffusion for Text: Why Mercury Could Make LLMs 10x Faster

And so the cost per token is going to go down.

1000.098 View full episode →

The Neuron: AI Explained

Diffusion for Text: Why Mercury Could Make LLMs 10x Faster

And that's why we're able to serve our models much more cheaply than what you would get if you were to use traditional ultra-aggressive models, because we make better use of the existing hardware.

1003.702 View full episode →

Appearances Over Time

Podcast Appearances

Sign in to Audioscrape

Share this moment