Dwarkesh Patel
π€ SpeakerVoice Profile Active
This person's voice can be automatically recognized across podcast episodes using AI voice matching.
Appearances Over Time
Podcast Appearances
And therefore, there's...
there's no transfer learning at the level at which they're going through the model.
And obviously, this is super relevant to the work you're doing because your hope is that by training the model both on the visual data that the robot sees, visual data generally, maybe even from YouTube or whatever eventually, plus language information, plus action information from the robot itself, all of this together will make it generally robust.
And then you had a really interesting blog post about why video models aren't as robust as language models.
Sorry, this is not a super well-formed question.
I just wanted you to react to that.
By the way, the fact that video models aren't as robust, is that bearish for robotics?
Because it will, so much of the data you will have to use will not, I guess some of, you're saying a lot of it will be labeled, but like, ideally you just want to be able to like throw all of everything on YouTube, every video we ever recorded and have it learn how the physical world works and how to like move about, et cetera, just see humans performing tasks and learn from that.
But if, yeah, I guess you're saying like it's hard to learn just from that and it actually needs to practice the task itself.
famously LLMs have all these emergent capabilities that were never engineered in because somewhere in internet text is the data to train and to give it the knowledge to do a certain kind of thing.
With robots, it seems like you are collecting all the data manually.
So there won't be this mysterious new capability that like is somewhere in the data set that you haven't purposefully collected, which seems like it should make it even harder to then have
robust out-of-distribution kind of capabilities.
And so I wonder if the Trek over the next 5-10 years will just be like
Each subtask, you have to give it thousands of episodes, and then it's very hard to actually automate much work just by doing subtasks.
So if you think about what a barista does, what a waiter does, what a chef does, very little bit involves just sitting at one station and doing stuff, right?
You've got to move around, you've got to restock, you've got to fix the machine, et cetera, go between the counter and the cashier and the machine, et cetera.
So...
Will there just be this long tail of things that you had to keep, skills you had to keep, like adding episodes for manually and labeling and seeing how well they did, etc.?
Or is there some reason to think that it will progress more generally than that?