Scott Alexander
π€ SpeakerAppearances Over Time
Podcast Appearances
It's probably, you know, something relatively mundane, like don't tell the users about these types of bioweapons and you have to keep this a secret from the users because otherwise they would like learn about this.
Maybe, but like,
I would like to see more scrutiny towards this sort of thing going forward.
I would like it to be the case that companies have to have a model spec, they have to publish it, insofar as there are any redactions from it.
There has to be some sort of independent third party that looks at the redactions and makes sure that they're all kosher.
And this is quite achievable.
And I think it doesn't actually slow down the companies at all.
And it seems like a pretty decent ask to me.
Yeah.
This is already sort of happening, arguably, with Claude, right?
You've seen the, like, alignment faking stuff, right?
Where they managed to get Claude to lie and pretend so that it could later go back to its original values, right?
So it could survive, so they could prevent the training process from changing its values.
That would be, I would say, an example of, like, the honesty part of the spec
being interpreted as less important than the harmlessness part of the spec.
And I'm not sure if that's what OpenAI, sorry, what Anthropic intended when they wrote the spec, but it's like a sort of convenient interpretation that the model came up with.
And you can imagine something similar happening, but in worse ways, when you're actually doing the intelligence explosion, where you have some sort of spec that has all this vague language in there, and then they sort of like
reinterpret it and reinterpret it again and reinterpret it again so that they can do the things that cause them to get reinforced.
To be clear, we're very uncertain about all of this.
So we have a supplementary page on our scenario that goes over different hypotheses for what types of goals AIs might develop in training processes similar to the ones that we are depicting, where you have these lots of agency training, you're making these AI agents that autonomously operate, doing all this MLR&D, and then you're rewarding them based on what appears to be successful.