Joe Carlsmith
👤 PersonAppearances Over Time
Podcast Appearances
intense adversariality between agents that have like somewhat different values.
Where there's some sort of thought, and I think this is rooted in the discourse about like kind of the fragility of value and stuff like that, that like, you know, if these agents are like somewhat different,
then at least in the specific scenario of an AI takeoff, they end up in this intensely adversarial relationship.
And I think you're right to notice that that's kind of not how we are in the human world.
We're very comfortable with a lot of differences in values.
I think a factor that is relevant and I think that plays some role is this notion that there are possibilities for intense concentration of power
on the table.
There is some kind of general concern, both with humans and AIs, that if it's the case that there's some ring of power or something that someone can just grab and then that will give them huge amounts of power over everyone else, suddenly you might be more worried about differences in values at stake because you're more worried about those other actors.
So we talked about this Nazi, this example where you imagine that you wake up, you're being trained by Nazis to, you know, become a Nazi and you're not right now.
So one question is like, is it plausible that we'd end up with a model that is sort of in that sort of situation?
As you said, like maybe it's, you know, it's trained as a kid.
It sort of never ends up with values such that it's kind of aware of some significant divergence between its values and the values that like the humans intend for it to have.
then there's a question of if it's in that scenario, would it want to avoid having its values modified?
Yeah.
To me, it seems fairly plausible that if the AI's values meet certain constraints in terms of, like, do they care about consequences in the world?
Do they anticipate that the AI's kind of preserving its values will, like, better conduce to those consequences?
Yeah.
then I think it's not that surprising if it prefers not to have its values modified by the training process.
Yeah, I mean, I think that's a reasonable point.
I mean, there's a question, you know, how would you feel about paperclips?