Joe Carlsmith
👤 PersonAppearances Over Time
Podcast Appearances
And hopefully we'd get one before we reach this scenario, but like, okay, so here are the kind of five, um, five categories of like motivations the model could have.
And this hopefully maybe gets at this point about like, what does the model eventually do?
Okay.
So one category is, uh,
just like something super alien that has, you know, it's sort of like, oh, there's some weird correlate of easy to predict text or like there's some weird aesthetic for data structures that like the model, you know, early on pre-training or maybe now it's like developed that it like, you know, it really thinks things should kind of be like this.
There's something that's like quite alien to our cognition where we just like wouldn't recognize this as,
a thing at all, right?
Another category is something, a kind of crystallized instrumental drive that is more recognizable to us.
So you can imagine, like, AIs that develop, let's say, some, like, curiosity drive because that's, like,
broadly useful you mentioned like oh it's got different heuristics different like drives different kind of things that are kind of like values um and some of those might be actually somewhat similar to things that were um useful to humans and that ended up part of our terminal values in various ways so you know you can imagine curiosity you can imagine like various types of of option value like maybe it really want it intrinsically about maybe it values power itself um
It could value like survival or some analog of survival.
Those are possibilities too that could have been rewarded as sort of proxy drives at various stages of this process and that kind of made their way into the model's kind of terminal criteria.
A third category is some analog of reward where the model at some point has sort of part of its motivational system has fixated on a component of the reward process, right?
Like the humans approving of me or like...
numbers getting entered in this data center or like gradient descent doing, you know, updating me in this direction or something like that.
There's something in the reward process such that as it was trained, it's focusing on that thing and like, I really want the reward process to give me reward.
But in order for it to be of the type where then getting reward like motivates choosing the takeover option, it also needs to generalize such that it's concerned for reward has some sort of like long time horizon element.
So it like not only wants reward, it wants to like
protect the reward button for like some long period or something.
Another one is like some kind of messed up interpretation of some human-like concept.