Joe Carlsmith
👤 PersonAppearances Over Time
Podcast Appearances
So, you know, maybe the AIs are like, they really want to be like shmeltful and like shmonest and shmarmless, right?
But their concept is, like, importantly different from the human concept.
And they know this.
So they know that the human concept would mean blah.
But they, like, ended up... Their values ended up fixating on, like, a somewhat different structure.
Yeah.
So that's, like, another version.
And then a fourth version... Or a fifth version, which I think...
you know, I think about less because I think it's just like such an own goal if you do this, but I do think it's possible.
It's just like, you could have AIs that are actually just doing what it says on the tin.
Like you have AIs that are just genuinely aligned to the model spec.
They're just really, they're just really trying to like benefit humanity and reflect well on open AI.
And what's, what's the, what's the other one?
Help that, you know, assist the developer or the user, right?
Yeah.
But your model spec, unfortunately, was just not robust to the degree of optimization that this AI is bringing to bear.
And so, you know, it decides when it's looking out at the world and they're like, what's the best way to benefit open AI and or sorry, reflect on open AI and and benefit humanity and such.
And so it decides that, you know, the best way is to go rogue.
That's I think that's like a real own goal, because at that point you like.
you got so close, you know, you really, you really, you just have to write the model spec well, um, and red team it suitably.