Stefano Ermon
๐ค SpeakerAppearances Over Time
Podcast Appearances
kind of like denoising steps do you need to do so essentially how fast they are that's another thing that we need to track and we and we need to optimize because there's always kind of like a trade-off between quality and speed and so everything becomes a little bit more complicated because there is an extra knob that you can kind of play with um and so there are a few other things that are maybe diffusion specific uh but broadly uh i think it's still
always going to be a matter of speed quality and cost yeah you always boil down to those three things and uh you know speed and cost are relatively easy to measure and and there's not a ton of wiggle room in terms of like how you do things quality is really the harder one that's when you know
it's it's hard to say which model is better there is still a lot of uh qualitative evaluations and a lot of uh vibes involved yeah like seeing this model better than the other model but on the other hand it's so important right because you cannot do good engineering you cannot do uh
proper R&D if you're not tracking the things that matter.
I mean, ultimately, what we see with our customers, the gold standard is some kind of A-B test on the business-relevant metric.
I think ultimately that is the thing that...
decides whether people are going to buy, are going to switch to Mercury or not.
There is some business metric they care about.
And then they, you know, you do an A-B test and then you see if Mercury is better and if it is and the cost is right and, you know, there are liabilities there and all the other things that matter are there, then they switch.
And so I think that's kind of like the gold standard.
Unfortunately, you know, it takes infrastructure.
Not every client has the maturity to be able to run A-B tests properly.
And it's also expensive.
So, of course, ideally, leading up to the A-B test, you want to be able to
have some kind of offline evaluation to kind of like guide the model selection and even just make any sense of is it even worth doing an AP test.
So coming up with some good kind of like offline evals, which could be based on rubrics, other lenses of judge or some kind of proxy for the thing you think will matter in production.
I think that's always a good step.
It's going to save you time.
It's going to save you money.
And it's always important to do that as well.