Stefano Ermon
๐ค SpeakerAppearances Over Time
Podcast Appearances
We figured out how to adapt the math and the underlying sort of like algorithms
from continuous spaces to discrete spaces like text and code.
And we had some really promising results at the GPT-2 scale.
So, you know, at Stanford, in university labs, you don't have access to a lot of GPU, a lot of compute.
And so the largest model we were able to train was a GPT-2 sized model.
But basically what we did is we took a GPT-2 sized model and we train it as a diffusion model and as an autoregressive model on the same data.
And so what we found was that the quality was the same.
So you were able to actually, you know, if you think about perplexity, which is the way people usually, um, you know, used to, to figure out how good the model fits the data, what's the quality of the generations you get.
We were matching the perplexity, but we were able to be like 10 times faster.
And so that was super exciting to me.
And I really wanted to see what happens if you train something bigger than a GPT-2 model.
you know, is it possible to build something commercially viable?
And that's why I started the company to scale things up.
Since then, yes, yes.
Now the Mercury models that we have in production are significantly larger.
They've been trained on more data.
There is a lot of engineering work that went into kind of like post-training the models and making sure that they would be useful for tasks that people care about, like commercial kind of use cases of LLMs.
Yeah, so the models are still fairly large in terms of the number of parameters.
We're still using actually similar architectures.
So under the hood, it's still a transformer.