Dwarkesh Patel
π€ SpeakerVoice Profile Active
This person's voice can be automatically recognized across podcast episodes using AI voice matching.
Appearances Over Time
Podcast Appearances
So it sounds like the first fun thing to do is probably to start looking at what an order book actually looks like.
If this sounds interesting to you, you should consider working at Hudson River Trading.
I was talking to this researcher, Sander, at GDM, and he works on video and audio models.
And he made the interesting point that the reason, in his view, we aren't seeing that much transfer learning between different modalities, that is to say, like training a language model on video and images, doesn't seem to necessarily make it that much better at textual learning.
questions and tasks, is that images are represented at a different semantic level than text.
And so his argument is that text has this high-level semantic representation within the model, whereas images and videos are just like compressed pixels.
There's not really a semantic... When they're embedded, they don't represent some high-level semantic information.
They're just like compressed pixels.
And therefore, there's...
there's no transfer learning at the level at which they're going through the model.
And obviously, this is super relevant to the work you're doing because your hope is that by training the model both on the visual data that the robot sees, visual data generally, maybe even from YouTube or whatever eventually, plus language information, plus action information from the robot itself, all of this together will make it generally robust.
And then you had a really interesting blog post about why video models aren't as robust as language models.
Sorry, this is not a super well-formed question.
I just wanted you to react to that.
By the way, the fact that video models aren't as robust, is that bearish for robotics?
Because it will, so much of the data you will have to use will not, I guess some of, you're saying a lot of it will be labeled, but like, ideally you just want to be able to like throw all of everything on YouTube, every video we ever recorded and have it learn how the physical world works and how to like move about, et cetera, just see humans performing tasks and learn from that.
But if, yeah, I guess you're saying like it's hard to learn just from that and it actually needs to practice the task itself.
famously LLMs have all these emergent capabilities that were never engineered in because somewhere in internet text is the data to train and to give it the knowledge to do a certain kind of thing.
With robots, it seems like you are collecting all the data manually.
So there won't be this mysterious new capability that like is somewhere in the data set that you haven't purposefully collected, which seems like it should make it even harder to then have