Sergey Levine
๐ค SpeakerAppearances Over Time
Podcast Appearances
Yeah.
And I think we're seeing the beginnings of that with multimodal models.
But I think that multimodality has so much more to it than just like image plus text.
And I think that that's a place where there's a lot of room for really exciting innovation.
Yeah, how we represent both context, both what happened in the past, and also plans or reasoning, as you can call it in our world, which is what we would like to happen in the future or intermediate processing stages in solving a task.
I think doing that in a variety of modalities, including potentially learned modalities that are suitable for the job, is something that has, I think, enormous potential to overcome some of these challenges.
Interesting.
Yeah, that's a really good question.
So I definitely don't know the answer to this.
I am not by any means well-versed in neuroscience.
But if I had to guess and also provide an answer that leans more on things I know, it's something like this, that the brain is extremely parallel.
It kind of has to be just because of the biophysics.
But
It's even more parallel than your GPU.
If you think about how a modern multimodal language model processes the input, if you give it some images and some text, first it reads in the images, then it reads in the text, and then proceeds one token at a time to generate the output.
It makes a lot more sense to me for an embodied system to have parallel processes.
Now, mathematically, you can actually make close equivalences between parallel and sequential stuff.
Like transformers aren't actually fundamentally sequential.
Like you kind of make them sequential by putting in position embeddings.
Transformers are fundamentally actually very parallelizable things.