Sergey Levine
๐ค SpeakerAppearances Over Time
Podcast Appearances
LLMs started off being trained purely with next token prediction, and that provided an excellent starting point first for all sorts of synthetic data generation and then for RL.
So I think it makes total sense that we would expect basically any foundation model effort to follow that same trajectory where we first build out the foundation, essentially in like a somewhat brute force way.
And the stronger that foundation gets, the easier it is to then make it even better with much more accessible training.
I really hope that they will actually be the same.
And, you know, obviously I'm extremely biased.
I love robotics.
I think it's like it's very fundamental to AI.
But I think that it's optimistically that it's actually the other way around, that the robotics
element of the equation will make all the other stuff better.
There are two reasons for this that I could tell you about.
One has to do with representations and focus.
What I said before, with video prediction models, if you just want to predict everything that happens, it's very hard to figure out what's relevant.
If you have the focus that comes from actually trying to do a task, now that acts to structure how you see the world in a way that allows you to more fruitfully utilize the other signals.
That could be extremely powerful.
The second one is that understanding the physical world at a very deep fundamental level, at a level that goes beyond just what we can articulate with language, can actually help you solve other problems.
And we experience this all the time.
Like when we talk about abstract concepts, we say like this company has a lot of momentum.
We'll use social metaphors to describe inanimate objects like my computer hates me.
We experience the world in a particular way and our subjective experience shapes how we think about it in very profound ways.
And then we use that as a hammer to basically hit all sorts of other nails that are far too abstract to handle any other way.