Joe Carlsmith
👤 PersonAppearances Over Time
Podcast Appearances
One thing I'll just say off the bat, it's like when I'm thinking about misaligned AIs, I'm thinking about, or the type that I'm worried about, I'm thinking about AIs that have a relatively specific set of properties related to agency and planning and kind of awareness and understanding of the world.
One is this capacity to plan.
and kind of make kind of relatively sophisticated plans on the basis of models of the world, where those plans are being kind of evaluated according to criteria.
That planning capability needs to be driving the model's behavior.
So there are models that are sort of in some sense capable of planning, but it's not like when they give output, it's not like that output
was determined by some process of planning.
Like, here's what'll happen if I give this output, and do I want that to happen?
The model needs to really understand the world, right?
It needs to really be like, okay, here's what will happen.
Here I am, here's my situation, here's the politics of the situation.
Really kind of having this kind of situational awareness to be able to evaluate the consequences of different plans.
I think the other thing is like,
So the verbal behavior of these models, I think, need bear no... So when I talk about a model's values, I'm talking about the criteria that kind of end up determining which plans the model pursues, right?
And a model's verbal behavior, even if it has a planning process, which GPT-4, I think, doesn't in many cases, its verbal behavior just doesn't...
doesn't need to reflect those criteria.
Right.
Um, and so, uh,
you know, we know that we're going to be able to get models to say what we want to hear.
Right.
We, uh, uh, that is the magic of gradient descent.