Stefano Ermon
๐ค SpeakerAppearances Over Time
Podcast Appearances
regime to a compute bound regime where you basically are bounded by the number of flops that you have available on the GPU, which is a much easier quantity to scale.
Like if you were a chief manufacturer, it's a lot easier to add flops that do increase memory bandwidth.
Yeah, some of the labs are very entrenched in a certain stack.
And so there's a big cost if they were to switch to something different.
There is quite a bit of secret sauce involved in terms of like, what is the right way to train these models?
What is the right way to...
you know, even just sample from them.
Like it's not as obvious as, okay, generate one token, append it, generate the next one, append it.
There is not much you can do on the inference side if you have a traditional autoregressive model.
But on a diffusion model, the design space for inference algorithms is much broader.
And then there is also the issue of kind of like the, at the MLCs level, like if you think about actually serving production workloads, there is a decent amount of open source and closed source, of course, solutions for ultra-aggressive models.
Things like VLLM, SGLang, TensorRT, like there is pretty mature serving stacks for ultra-aggressive models.
For diffusion models, it's much more,
You know, it's much earlier, we have our own stock, but, uh, you know, it takes a significant amount of work.
If you want to figure out how to actually make things efficient in practice on real world GPUs and there's all kinds of optimizations that you can do on the systems.
Yeah.
Basically, if you can generate more tokens per second, what this means is that, you know, for the same price,
amount of hardware for the same number of GPUs, you can produce more tokens.
And so the cost per token is going to go down.
And that's why we're able to serve our models much more cheaply than what you would get if you were to use traditional ultra-aggressive models, because we make better use of the existing hardware.