Dwarkesh Patel
π€ SpeakerVoice Profile Active
This person's voice can be automatically recognized across podcast episodes using AI voice matching.
Appearances Over Time
Podcast Appearances
Right.
I had an example like this when I got a tour of the robots, by the way, at your office.
So it was folding shorts.
And I don't know if there was an episode like this in the β
in the training set, but just for fun, I took one of the shorts and turned it inside out.
And then it was able to understand that it first needed to get... So first of all, the grippers are just like this, like two limbs, or just a poseable finger and thumb-like thing.
And it's actually shocking how much you can do with just that.
Yeah, it understood that it first needed to fold it inside out before folding it correctly.
I mean, what's especially surprising about that is...
It seems like this model only has one second of context.
So as compared to these language models, which can often see the entire code base, and they're observing hundreds of thousands of tokens and thinking about them before outputting, and they're observing their own chain of thought for thousands of tokens before making a plan about how to code something up, your model is seeing one image of what happened in the last second.
And it vaguely knows it's supposed to fold this short.
And it's seeing the image of what's happened in the last second.
And I guess it works.
It's crazy that it will just see the last thing that happened and then keep executing on the plan.
So fold it inside out, then fold it correctly.
But it's shocking that a second of context is enough to execute on a minute-long task.
Yeah, I'm curious why you made that choice in the first place and why it's possible to actually do tasks.
If a human could only think ahead a second of memory and had to do physical work, I feel like that would just be impossible.
And how physically will β so you have this like trilemma.