Haseeb Qureshi
👤 SpeakerAppearances Over Time
Podcast Appearances
So you point out something very deep, which is that the innovation that we've had in AI has all really been driven by large language models, right?
And we call them large language models because they're trained on corpuses of text.
Text, that's the keyword, text, okay?
It's a large language model.
Now we're trying to apply the massive innovations we've seen in large language models to other modalities, right?
We're trying to apply it now to images, to video, to robotics, to all these other things.
And what you see is that in text is where we have the most runaway capabilities.
And everything else is kind of, you know, it's pretty good, but it's nowhere near as robust and as powerful as in text.
When you are getting an AI agent to interact with a computer, so this is known as computer use, and all the different labs are trying to get their models to become better and better at computer use.
The problem with computer use is that if you are trying to get a model to interact with clicking a button on a screen,
Literally what you're doing is you're taking a picture of a screen, you're tokenizing that, you're turning it into like these patches, and you're trying to give the model some kind of deep representation of these patches when it's been trained on text.
Right.
Right.
like Texas where we have billions and billions and billions, we have the entire corpus of human history that we fed into these models and we've have, you know, like not that many pictures of computers.
Like, I mean, obviously we were generating a lot more and they're trying and trying and trying and this will get better because we have a lot of synthetic data to be able to continue generating and feeding these models.
But the reality is that
Interfaces were created for humans, right?
But these models grew up on text.
They were born in text.
Text is the soup that they swim in.