Joe Carlsmith
👤 PersonAppearances Over Time
Podcast Appearances
right?
You understand it.
This is why I'm saying like the, when the model, you're like, Oh, the models really understand human values.
It's like, yeah.
Yes, I think that's... So yeah, I think basically a decent portion of the hope here, or I think we should just... An aim should be we're never in the situation where the AI really has very different values.
already is quite smart, really knows what's going on, and is now in this kind of adversarial relationship with our training process, right?
So we wanna avoid that.
The main thing, and I think it's possible we can via the sorts of things you're saying.
So I'm not like, ah, that'll never work.
The thing I just wanted to highlight was like, if you get into that situation,
And if the AI is genuinely at that point, like much, much more sophisticated than you and doesn't want to kind of reveal its true values for whatever reason, then, you know, when the children show like some like kind of obviously fake opportunity to like defect to the allies, right?
You, you know, it's sort of not necessarily going to be a good test of what will you do in the real circumstance because you're able to tell.
Yeah.
I mean, I don't know.
I think I'm hesitant to be like, it's like drugs for the model.
Like I think there's... But broadly speaking, I do...
basically agree that I think we have like really quite a lot of tools and options, um, for kind of training AIs, even AIs that are kind of somewhat smarter than humans.
I do think you have to actually do it.
So I, you know, I am, I think compared to maybe you had Eliezer on, like, I think I'm much more, much more bullish on our ability to solve this problem, especially for AIs that are, um,
in what I think of as the AI for AI safety sweet spot, which is this sort of band of capability where they're both very sufficiently capable that they can be really useful for strengthening various factors in our civilization that can make us safe.