Stefano Ermon
๐ค SpeakerAppearances Over Time
Podcast Appearances
Yeah, I think it changes a little bit.
the design space in terms of how many flops you have access to and how memory bound you are, but not fundamentally.
You still need to process and you still need to be able to look at all the context, all the past information to be able to generate good quality answers, whether you do them one token at a time or you do them in parallel.
kind of have to look at the past.
And so there is something pretty fundamental there.
Yeah, so essentially there is an element of error correction and sort of like the models are trained to fix mistakes and then they
they constantly revise the answer.
And so initially the answers are not coherent and then they get increasingly better as you throw essentially more compute.
You can also think of it as another dimension over which you can scale compute at test time.
So test time inference, test time compute to trade off quality for speed or cost.
And so that's kind of like the fundamental trade-off that is exposed by a diffusion language model.
It just provides you an axis to control quality versus speed.
And there is a fundamental trade-off.
Like if you want to be, you know, if you don't want to do too many denoising steps, too many passes over the output, the quality is not going to be as good as what you would get if you refine it many, many, many times.
That's exactly the same.
Exactly, exactly.
So, you know, even in the context of image and video generation, you can usually control the number of denoising steps.
The more denoising steps you take, the higher the quality, but of course, the more expensive it becomes, the more time it takes.
So it's related in the sense that it needs to output something.
And then if there is a tool call involved, then you need to essentially wait until the result of the tool call comes back.