Joe Carlsmith
👤 PersonAppearances Over Time
Podcast Appearances
You know, if you, uh, you know, modulo, like some difficulties with capabilities, like you can get a model to kind of output the behavior that you want.
If it doesn't, then you, you crank it till it does.
Right.
And, um, and I think everyone admits for suitably sophisticated models, they're going to have very detailed understanding of human morality.
Um, uh,
But the question is what relationship is there between a model's verbal behavior, which you've essentially kind of clamped.
You're like, the model must say blah things, and the criteria that end up influencing its choice between plans.
And there, I think it's at least, I'm kind of pretty cautious about being like, well, when it says the thing I forced it to say,
or like, you know, gradient dissented it such that it says, that's a lot of evidence about like how it's going to choose in a bunch of different scenarios.
I mean, for one thing, like even with humans, right, it's not necessarily the case that humans, their kind of verbal behavior reflects the actual factors that determine their choices.
They can lie.
They can not even know what they would do in a given situation.
I mean,
You know, for folks who are kind of unfamiliar with the basic story, but maybe folks are like, wait, why are they taking over at all?
Like, what is literally any reason that they would do that?
So, you know, the general concern is like...
you know, if you're really offering someone, especially if you're really offering someone like power for free, you know, power almost by definition is kind of useful for lots of values.
And if we're talking about an AI that really has the opportunity to kind of take control of things, if some component of its values is sort of focused on some outcome, like the world being a certain way, and especially kind of in a kind of longer term way, such that the kind of horizon of its concern extends beyond the period that the kind of takeover plan would encompass,
then the thought is it's just kind of often the case that the world will be more the way you want it if you control everything than if you remain the instrument of control.
the human will or of some other actor, which is sort of what we're hoping these AIs will be.