Stefano Ermon
๐ค SpeakerAppearances Over Time
Podcast Appearances
Yeah.
And so, you know, it does make a lot of sense.
One of the challenges is that
if you wanted to go straight from voice to voice, is that often these kind of interactions still involve tool calls.
So if you're doing a customer support, you might still need to be able to query a database or check a calendar for availabilities or look up the menu to get the prices.
And so there still needs to be some text, I think, some code involved, which makes it a little bit more tricky to develop.
But we are very excited about eventually getting to something that is actually multimodal.
The existing Mercury models are just text only or code only.
But we know the future models work really well for image, video, music.
And so if we put everything together, we could get to something like truly phenomenal handling different kinds of modalities and have a real world model that understands everything and puts together all the possibilities.
the learnings and the signals from all the different modalities but it's definitely something we want to do at some point yeah that would be so awesome would that be something that would be useful for like a robotic kind of situation or would that be more for like simulations that you can use to train robots in your opinion could be a mix it could be decision making like if you're using a video or other kinds of sensors as input and then use the model to make decisions or
kind of like analyze what's going on in the surroundings.
It's a very useful kind of like application of this technology.
In fact, we've already heard it from some early adopters that they would love for our models to have image
inputs because they're building computer agents.
And so that's another space where you really need to be quick.
You need to be able to interact fast with whatever software, whatever application the agent interacts with.
But it's important not to just look at the text and the HTML code, let's say, of a web page, but actually seeing what's happening.
And so that would open up a lot of other applications, I think.
For computer use, no, it would be more like controlling the actions.