Stefano Ermon
๐ค SpeakerAppearances Over Time
Podcast Appearances
So what we found is that that kind of architecture actually works pretty well also for, you know, as a backbone for a diffusion language model.
And so, you know, we've kind of used what we knew worked well and is well supported by existing, you know, frameworks and open source code.
And so, yeah, the neural networks are not that different.
It's just like they're trained in a different way and they're used in a different way at the inference time.
That's right.
Yes.
And I think, uh, perhaps that's actually suboptimal.
Like, uh, I mean, people have kind of converged on transformers as being a really good, uh, architecture for, you know, autoregressive models.
I mean, these days people also use them for diffusion models, like people use diffusion transformers.
So it's kind of like an architecture that it's widely used across different modalities, across different kinds of generative models.
But it's possible that there might be better architectures that shine even better once you change the generative model.
It's no longer autoregressive.
So I think the design space is different.
I think there is a lot more room for doing R&D and coming up with further improvements just by kind of matching the neural network architecture to the training objective and to the inference kind of computations that we do.
Right.
That makes sense.
Very cool.
Yeah, basically what's parallelized is that the network is able to essentially modify multiple tokens at the same time.
Wow.
And so that's kind of like what you were seeing.