John Schulman
๐ค SpeakerAppearances Over Time
Podcast Appearances
Like you'd want to have some combination of like the model itself seems to be like really well behaved and have like impeccable performance.
moral compass and everything and you're pretty confident that it's it's extremely resistant to any kind of takeover attempt or something or like severe misuse and then you'd also want to have like uh really good monitoring on top of it so yeah you could detect any kind of any trouble what are you keeping track of while you're doing long horizon rl or when you eventually start doing it that uh you you could notice this sort of discontinuous jump before you deployed these systems broadly
I would say you would want to have a lot of evals that you're running during the training process.
And what specifically?
How would you notice something like that?
You'd want to be pretty careful when you do this kind of training if you see a lot of potentially scary capabilities, if those seem close.
I mean, I would say it's not something we have to be scared of right now because right now it's hard to get the models to do anything coherent.
But if they started to get really good, I think...
Yeah, I think we would want to, we would have to take some of these questions seriously and we would want to have a lot of evals that like sort of test them for misbehavior in the most, or I guess that's like for the alignment of the models.
We would want to check
uh, we want to check that, um, they're not gonna, um, they're not gonna sort of turn against us or something.
Uh, but you, you might also want to look for, uh, like discontinuous jumps and capabilities like, um, uh,
you'd wanna have lots of evals for the capabilities of the models.
I mean, also, I guess you'd also wanna make sure that whatever you're training on doesn't have any reason to make the model turn against you, which itself I think isn't, I would say there's like, that doesn't seem like the hardest thing to do.
I mean, if like the way we train them with RLHF,
That does feel, even though the models are very smart, it does feel very safe because the model is just trying to produce a message that is pleasing to a human.
And it has no concern about anything else in the world other than whether this text it produces
is approved.
So obviously if you were doing something where the model has, yeah, it's carrying out a long sequence of actions, which involve tools and everything, then it might have some incentive to do a lot of wacky things that wouldn't make sense to a human in the process of producing its final result.
But I guess it wouldn't necessarily have an incentive to do anything other than produce a very high quality output at the end.