Dwarkesh Patel
π€ SpeakerVoice Profile Active
This person's voice can be automatically recognized across podcast episodes using AI voice matching.
Appearances Over Time
Podcast Appearances
It's a bit like saying to somebody pasteurizing milk, hey, you should stop boiling that milk because eventually you want to serve it cold.
Of course, but this is an intermediate step to facilitate the final output.
By the way, LLMs are clearly developing a deep representation of the world because their training process is incentivizing them to develop one.
I use LLMs to teach me about everything from biology to AI to history, and they are able to do so with remarkable flexibility and coherence.
Now, are LLMs specifically trained to model how their actions will affect the world?
No, they are not.
But if we're not allowed to call their representations a world model,
then we're defining the term world model by the process that we think is necessary to build one, rather than the obvious capabilities that this concept implies.
Okay, continual learning.
I'm sorry to bring up my hobby horse again.
I'm like a comedian who has only come up with one good bit, but I'm going to milk it for all it's worth.
An LLM that's being RL'd on outcome-based rewards learns on the order of one bit per episode, and an episode might be tens of thousands of tokens long.
Now, obviously, animals and humans are clearly extracting more information from interacting with our environment than just the reward signal at the end of an episode.
Conceptually, how should we think about what is happening with animals?
I think we're learning to model the world through observations.
This outer loop RL is incentivizing some other learning system to pick up maximum signal from the environment.
In Richard's oak architecture, he calls this the transition model.
And if we were trying to pigeonhole this feature spec into modern LLMs, what you do is fine tune on all your observed tokens.
From what I hear from my researcher friends, in practice, the most naive way of doing this actually doesn't work very well.
Now, being able to learn from the environment in a high throughput way is obviously necessary for true AGI.