Joe Carlsmith
👤 PersonAppearances Over Time
Podcast Appearances
You know, maybe you don't despise paperclips, but there's like the human paperclippers there and they're like training you to make paperclips.
You know, my sense would be that there's a kind of relatively specific set of conditions in which you're comfortable having your value, especially not changed by like learning and growing, but like radiant descent directly intervening on your neurons.
So yes, it could be like that.
There's one, there's a kind of scenario in which you were comfortable with your values being changed because in some sense you have allegiance to the, the sufficient allegiance to the output of that process.
Like, so you're kind of hoping in a, in a religious context, you're like, ah, like make me more, uh, virtuous by the lights of this religion.
And you, you know, you go to confession and you're like, you know, uh, you know, I've been, I've been thinking about takeover today.
Can you change me, please?
Like, give me more gradient descent.
You know, I've been bad, so bad.
And so, you know, people sometimes use the term corrigibility to talk about that.
Like when the AI, it maybe doesn't have perfect values, but it's in some sense cooperating with your efforts to change its values to be a certain way.
So maybe it's worth saying a little bit here about what actual values the AI might have.
You know, would it be the case that the AI...
naturally has the sort of equivalent of like, I'm sufficiently devoted to this, um, human, to human obedience that I'm going to like really want to be modified.
So I'm kind of like a better instrument of the human will, um, versus like wanting to go off and do my own thing.
Um, it could be, could be benign, you know, uh, could go well.
Um,
here are some like possibilities I think about that like could make it bad.
And I think I'm just generally kind of concerned about how little I feel like I, how little science we have of model motivations, right?
It's like, we just don't, I think we just don't have a great understanding of what happens in the scenario.