Menu
Sign In Search Podcasts Charts People & Topics Add Podcast API Blog Pricing

Stefano Ermon

๐Ÿ‘ค Speaker
359 total appearances

Appearances Over Time

Podcast Appearances

The Neuron: AI Explained
Diffusion for Text: Why Mercury Could Make LLMs 10x Faster

regime to a compute bound regime where you basically are bounded by the number of flops that you have available on the GPU, which is a much easier quantity to scale.

The Neuron: AI Explained
Diffusion for Text: Why Mercury Could Make LLMs 10x Faster

Like if you were a chief manufacturer, it's a lot easier to add flops that do increase memory bandwidth.

The Neuron: AI Explained
Diffusion for Text: Why Mercury Could Make LLMs 10x Faster

Yeah, some of the labs are very entrenched in a certain stack.

The Neuron: AI Explained
Diffusion for Text: Why Mercury Could Make LLMs 10x Faster

And so there's a big cost if they were to switch to something different.

The Neuron: AI Explained
Diffusion for Text: Why Mercury Could Make LLMs 10x Faster

There is quite a bit of secret sauce involved in terms of like, what is the right way to train these models?

The Neuron: AI Explained
Diffusion for Text: Why Mercury Could Make LLMs 10x Faster

What is the right way to...

The Neuron: AI Explained
Diffusion for Text: Why Mercury Could Make LLMs 10x Faster

you know, even just sample from them.

The Neuron: AI Explained
Diffusion for Text: Why Mercury Could Make LLMs 10x Faster

Like it's not as obvious as, okay, generate one token, append it, generate the next one, append it.

The Neuron: AI Explained
Diffusion for Text: Why Mercury Could Make LLMs 10x Faster

There is not much you can do on the inference side if you have a traditional autoregressive model.

The Neuron: AI Explained
Diffusion for Text: Why Mercury Could Make LLMs 10x Faster

But on a diffusion model, the design space for inference algorithms is much broader.

The Neuron: AI Explained
Diffusion for Text: Why Mercury Could Make LLMs 10x Faster

And then there is also the issue of kind of like the, at the MLCs level, like if you think about actually serving production workloads, there is a decent amount of open source and closed source, of course, solutions for ultra-aggressive models.

The Neuron: AI Explained
Diffusion for Text: Why Mercury Could Make LLMs 10x Faster

Things like VLLM, SGLang, TensorRT, like there is pretty mature serving stacks for ultra-aggressive models.

The Neuron: AI Explained
Diffusion for Text: Why Mercury Could Make LLMs 10x Faster

For diffusion models, it's much more,

The Neuron: AI Explained
Diffusion for Text: Why Mercury Could Make LLMs 10x Faster

You know, it's much earlier, we have our own stock, but, uh, you know, it takes a significant amount of work.

The Neuron: AI Explained
Diffusion for Text: Why Mercury Could Make LLMs 10x Faster

If you want to figure out how to actually make things efficient in practice on real world GPUs and there's all kinds of optimizations that you can do on the systems.

The Neuron: AI Explained
Diffusion for Text: Why Mercury Could Make LLMs 10x Faster

Yeah.

The Neuron: AI Explained
Diffusion for Text: Why Mercury Could Make LLMs 10x Faster

Basically, if you can generate more tokens per second, what this means is that, you know, for the same price,

The Neuron: AI Explained
Diffusion for Text: Why Mercury Could Make LLMs 10x Faster

amount of hardware for the same number of GPUs, you can produce more tokens.

The Neuron: AI Explained
Diffusion for Text: Why Mercury Could Make LLMs 10x Faster

And so the cost per token is going to go down.

The Neuron: AI Explained
Diffusion for Text: Why Mercury Could Make LLMs 10x Faster

And that's why we're able to serve our models much more cheaply than what you would get if you were to use traditional ultra-aggressive models, because we make better use of the existing hardware.