Joe Carlsmith
👤 PersonAppearances Over Time
Podcast Appearances
So that's a very specific scenario.
And if we're in a scenario where power is more distributed and especially where we're doing like decently on alignment, right?
And we're giving the AI some amount of inhibition about doing different things.
And maybe we're succeeding in shaping their values somewhat.
Now it is, I think it's just a much more complicated calculus, right?
And you have to ask, okay, like,
What's the upside for the AI?
What's the probability of success for this like takeover path?
So, you know, why is alignment hard in general, right?
Like, let's say we've got an AI, and let's, again, let's bracket the question of, like, exactly how capable will it be, and really just talk about this extreme scenario of, like, it really has this opportunity to take over, right?
Which I do think, you know, maybe we just want to not, we do not want to deal with that, with having to build an AI that we're comfortable being in that position, but let's just focus on it for the sake of simplicity, and then we can relax the assumption.
You know, okay, so you have some hope.
It's like, I'm going to build an AI over here.
So one issue is you can't just test.
You can't give the AI this literal situation, have it take over and kill everyone and then be like, oops, like update the weights.
This is the thing Eliezer talks about of sort of like, you can't, you know, you care about its behavior on this like specific scale.
In a specific scenario that you can't test directly.
Now, we can talk about whether that's a problem, but that's like one issue is that there's a sense in which this has to be kind of like off distribution and you have to be getting some kind of generalization from your training the AI on a bunch of other scenarios.
And then there's this question of how is it going to generalize to the scenario where it really has this option.
Yeah.