Stefano Ermon
๐ค SpeakerAppearances Over Time
Podcast Appearances
And so the costs are actually significantly lower.
Yeah, so that really depends more on the architecture than whether it's a diffusion model or an autoregressive model.
Right now, as I mentioned, we're still using self-attention, which unfortunately scales pretty poorly with the context length.
So I would say there is no difference.
It's not better, it's not worse than an autoregressive model as you think about longer context.
Our models are supporting roughly 100K tokens of context length.
we could potentially scale it up more.
Again, it's not something that is very different.
If you think about an autoregressive model versus a diffusion model, it's more a function of the underlying architecture.
And in fact, we can actually use alternative architectures that scale better with respect to the context like state-space models or other attention variants that are more efficient.
We have some preliminary results, so everything is compatible with different kind of backbones, but not in the production at the moment.
Nothing particularly.
I think it's just like a fundamental problem for which, you know, it's going to be hard to get, you know, like a real breakthrough.
Like there is just like,
inherent trade-offs.
I think of them in terms of sufficient statistics, like what do you store about your past and how do you keep track of, you know, you want to remember the things that are useful, you want to discard the things that are not useful.
And that's just fundamentally a hard problem.
Like there is no, there's always, you know, there is some kind of no free lunch involved where ahead of time, you don't know what you should remember and what you should discard.
And some things are going to be useful for something and they're going to be not useful for something else.
And so I think it's a fundamentally very difficult problem where you have to make trade-offs.