Demis Hassabis
๐ค SpeakerAppearances Over Time
Podcast Appearances
so so the reason we're building these kind of models is um we feel and we've always felt uh we're obviously progressing on the normal language models like with our gemini model but from the beginning with gemini we wanted it to be multimodal so we wanted it to input and take any kind of input images audio video and it can output anything
And so we've been very interested in this because for an AI to be truly general, to build AGI, we feel that the AGI system needs to understand the world around us and the physical world around us, not just the abstract world of languages or mathematics.
And of course, that's what's critical for robotics to work.
It's probably what's missing from it today.
And also things like smart glasses, a smart glasses system that helps you in your everyday life.
It's got to understand the physical context that you're in and how the intuitive physics of the world works.
So we think that building these types of models, these Genie models and also Veo, the best text-to-video models,
Those are expressions of us building world models that understand the dynamics of the world, the physics of the world.
If you can generate it, then that's an expression of your system understanding those dynamics.
Yeah, that's right.
So if you look at our Gemini Live version of Gemini, where you can hold up your phone to the world around you, I'd recommend any of you try it.
It's kind of magical what it already understands about the physical world.
You can think of the next step as incorporating that in some sort of more handy device like glasses.
And then it will be an everyday assistant.
It'll be able to recommend things to you as you're walking the streets, or we can embed it into Google Maps.
And then with robotics, we've built something called Gemini robotics models, which are sort of fine-tuned Gemini with extra robotics data.
And what's really cool about that is, and we released some demos of this over the summer, was we've got these tabletop setups of two hands interacting with objects on a table, two robotic hands.
And you can just talk to the robot.
So you can say, you know, put the yellow object into the red bucket or whatever it is, and it will interpret that instruction, that language instruction, into motor movements.
And that's the power of a multimodal model rather than just a robotic-specific model, is that it will be able to bring in real-world understanding to the way you interact with it.