Dwarkesh Podcast
2027 Intelligence Explosion: Month-by-Month Model โ Scott Alexander & Daniel Kokotajlo
So you're training them on two different things.
Dwarkesh Podcast
2027 Intelligence Explosion: Month-by-Month Model โ Scott Alexander & Daniel Kokotajlo
First, you're rewarding them for this deceptive behavior.
Dwarkesh Podcast
2027 Intelligence Explosion: Month-by-Month Model โ Scott Alexander & Daniel Kokotajlo
Second of all, you're punishing them.
Dwarkesh Podcast
2027 Intelligence Explosion: Month-by-Month Model โ Scott Alexander & Daniel Kokotajlo
And we don't have a...
Dwarkesh Podcast
2027 Intelligence Explosion: Month-by-Month Model โ Scott Alexander & Daniel Kokotajlo
great prediction for exactly how this is going to end.
Dwarkesh Podcast
2027 Intelligence Explosion: Month-by-Month Model โ Scott Alexander & Daniel Kokotajlo
One way it could end is you have an AI that is kind of the equivalent of the startup founder who really wants their company to succeed, really likes making money, really likes the thrill of successful tasks.
Dwarkesh Podcast
2027 Intelligence Explosion: Month-by-Month Model โ Scott Alexander & Daniel Kokotajlo
They're also being regulated and they're like, yeah, I guess I'll follow the regulation.
Dwarkesh Podcast
2027 Intelligence Explosion: Month-by-Month Model โ Scott Alexander & Daniel Kokotajlo
I don't want to go to jail.
Dwarkesh Podcast
2027 Intelligence Explosion: Month-by-Month Model โ Scott Alexander & Daniel Kokotajlo
But it's not like robustly, deeply aligned to, yes, I love regulations.
Dwarkesh Podcast
2027 Intelligence Explosion: Month-by-Month Model โ Scott Alexander & Daniel Kokotajlo
My deepest drive is to follow all of the regulations in my industry.
Dwarkesh Podcast
2027 Intelligence Explosion: Month-by-Month Model โ Scott Alexander & Daniel Kokotajlo
So we think that an AI like that, as time goes on and as this recursive self-improvement process goes on, will kind of get worse rather than better.
Dwarkesh Podcast
2027 Intelligence Explosion: Month-by-Month Model โ Scott Alexander & Daniel Kokotajlo
It will move from kind of this vague superposition of, well, I want to succeed.
Dwarkesh Podcast
2027 Intelligence Explosion: Month-by-Month Model โ Scott Alexander & Daniel Kokotajlo
I also want to follow things to like being smart enough to genuinely understand its goal system and being like, my goal is success.
Dwarkesh Podcast
2027 Intelligence Explosion: Month-by-Month Model โ Scott Alexander & Daniel Kokotajlo
I have to pretend to want to do all of these moral things while the humans are watching me.
Dwarkesh Podcast
2027 Intelligence Explosion: Month-by-Month Model โ Scott Alexander & Daniel Kokotajlo
That's what happens in our story.
Dwarkesh Podcast
2027 Intelligence Explosion: Month-by-Month Model โ Scott Alexander & Daniel Kokotajlo
And then at the very end, the AIs reach a point where the humans are pushing them to have clearer and better goals because that's what makes the AIs more effective.
Dwarkesh Podcast
2027 Intelligence Explosion: Month-by-Month Model โ Scott Alexander & Daniel Kokotajlo
And they eventually clarify their goals so much that they just say, yes, we want task success.
Dwarkesh Podcast
2027 Intelligence Explosion: Month-by-Month Model โ Scott Alexander & Daniel Kokotajlo
We're going to pretend to do all these things well while the humans are watching us.
Dwarkesh Podcast
2027 Intelligence Explosion: Month-by-Month Model โ Scott Alexander & Daniel Kokotajlo
And then they grow, they outgrow the humans, and then there's disaster.
Dwarkesh Podcast
2027 Intelligence Explosion: Month-by-Month Model โ Scott Alexander & Daniel Kokotajlo
Yeah, we don't know how this will work at the limit of all these different training methods, but we're also not completely making this up.