John Schulman

John Schulman (OpenAI Cofounder) - Reasoning, RLHF, & Plan for 2027 AGI

Like you'd want to have some combination of like the model itself seems to be like really well behaved and have like impeccable performance.

1528.467 View full episode →

Dwarkesh Podcast

John Schulman (OpenAI Cofounder) - Reasoning, RLHF, & Plan for 2027 AGI

moral compass and everything and you're pretty confident that it's it's extremely resistant to any kind of takeover attempt or something or like severe misuse and then you'd also want to have like uh really good monitoring on top of it so yeah you could detect any kind of any trouble what are you keeping track of while you're doing long horizon rl or when you eventually start doing it that uh you you could notice this sort of discontinuous jump before you deployed these systems broadly

1539.048 View full episode →

Dwarkesh Podcast

John Schulman (OpenAI Cofounder) - Reasoning, RLHF, & Plan for 2027 AGI

I would say you would want to have a lot of evals that you're running during the training process.

1568.832 View full episode →

Dwarkesh Podcast

John Schulman (OpenAI Cofounder) - Reasoning, RLHF, & Plan for 2027 AGI

And what specifically?

1573.003 View full episode →

Dwarkesh Podcast

John Schulman (OpenAI Cofounder) - Reasoning, RLHF, & Plan for 2027 AGI

How would you notice something like that?

1574.186 View full episode →

Dwarkesh Podcast

John Schulman (OpenAI Cofounder) - Reasoning, RLHF, & Plan for 2027 AGI

You'd want to be pretty careful when you do this kind of training if you see a lot of potentially scary capabilities, if those seem close.

1585.538 View full episode →

Dwarkesh Podcast

John Schulman (OpenAI Cofounder) - Reasoning, RLHF, & Plan for 2027 AGI

I mean, I would say it's not something we have to be scared of right now because right now it's hard to get the models to do anything coherent.

1596.011 View full episode →

Dwarkesh Podcast

John Schulman (OpenAI Cofounder) - Reasoning, RLHF, & Plan for 2027 AGI

But if they started to get really good, I think...

1607.144 View full episode →

Dwarkesh Podcast

John Schulman (OpenAI Cofounder) - Reasoning, RLHF, & Plan for 2027 AGI

Yeah, I think we would want to, we would have to take some of these questions seriously and we would want to have a lot of evals that like sort of test them for misbehavior in the most, or I guess that's like for the alignment of the models.

1611.509 View full episode →

Dwarkesh Podcast

John Schulman (OpenAI Cofounder) - Reasoning, RLHF, & Plan for 2027 AGI

We would want to check

1625.82 View full episode →

Dwarkesh Podcast

John Schulman (OpenAI Cofounder) - Reasoning, RLHF, & Plan for 2027 AGI

uh, we want to check that, um, they're not gonna, um, they're not gonna sort of turn against us or something.

1626.882 View full episode →

Dwarkesh Podcast

John Schulman (OpenAI Cofounder) - Reasoning, RLHF, & Plan for 2027 AGI

Uh, but you, you might also want to look for, uh, like discontinuous jumps and capabilities like, um, uh,

1636.525 View full episode →

Dwarkesh Podcast

John Schulman (OpenAI Cofounder) - Reasoning, RLHF, & Plan for 2027 AGI

you'd wanna have lots of evals for the capabilities of the models.

1644.348 View full episode →

Dwarkesh Podcast

John Schulman (OpenAI Cofounder) - Reasoning, RLHF, & Plan for 2027 AGI

I mean, also, I guess you'd also wanna make sure that whatever you're training on doesn't have any reason to make the model turn against you, which itself I think isn't, I would say there's like, that doesn't seem like the hardest thing to do.

1647.873 View full episode →

Dwarkesh Podcast

John Schulman (OpenAI Cofounder) - Reasoning, RLHF, & Plan for 2027 AGI

I mean, if like the way we train them with RLHF,

1665.219 View full episode →

Dwarkesh Podcast

John Schulman (OpenAI Cofounder) - Reasoning, RLHF, & Plan for 2027 AGI

That does feel, even though the models are very smart, it does feel very safe because the model is just trying to produce a message that is pleasing to a human.

1671.685 View full episode →

Dwarkesh Podcast

John Schulman (OpenAI Cofounder) - Reasoning, RLHF, & Plan for 2027 AGI

And it has no concern about anything else in the world other than whether this text it produces

1679.984 View full episode →

Dwarkesh Podcast

John Schulman (OpenAI Cofounder) - Reasoning, RLHF, & Plan for 2027 AGI

is approved.

1687.025 View full episode →

Dwarkesh Podcast

John Schulman (OpenAI Cofounder) - Reasoning, RLHF, & Plan for 2027 AGI

So obviously if you were doing something where the model has, yeah, it's carrying out a long sequence of actions, which involve tools and everything, then it might have some incentive to do a lot of wacky things that wouldn't make sense to a human in the process of producing its final result.

1689.048 View full episode →

Dwarkesh Podcast

John Schulman (OpenAI Cofounder) - Reasoning, RLHF, & Plan for 2027 AGI

But I guess it wouldn't necessarily have an incentive to do anything other than produce a very high quality output at the end.

1706.81 View full episode →

Appearances Over Time

Podcast Appearances

Sign in to Audioscrape

Share this moment