Menu
Sign In Search Podcasts Charts People & Topics Add Podcast API Blog Pricing

John Schulman

๐Ÿ‘ค Speaker
528 total appearances

Appearances Over Time

Podcast Appearances

Dwarkesh Podcast
John Schulman (OpenAI Cofounder) - Reasoning, RLHF, & Plan for 2027 AGI

Like you'd want to have some combination of like the model itself seems to be like really well behaved and have like impeccable performance.

Dwarkesh Podcast
John Schulman (OpenAI Cofounder) - Reasoning, RLHF, & Plan for 2027 AGI

moral compass and everything and you're pretty confident that it's it's extremely resistant to any kind of takeover attempt or something or like severe misuse and then you'd also want to have like uh really good monitoring on top of it so yeah you could detect any kind of any trouble what are you keeping track of while you're doing long horizon rl or when you eventually start doing it that uh you you could notice this sort of discontinuous jump before you deployed these systems broadly

Dwarkesh Podcast
John Schulman (OpenAI Cofounder) - Reasoning, RLHF, & Plan for 2027 AGI

I would say you would want to have a lot of evals that you're running during the training process.

Dwarkesh Podcast
John Schulman (OpenAI Cofounder) - Reasoning, RLHF, & Plan for 2027 AGI

And what specifically?

Dwarkesh Podcast
John Schulman (OpenAI Cofounder) - Reasoning, RLHF, & Plan for 2027 AGI

How would you notice something like that?

Dwarkesh Podcast
John Schulman (OpenAI Cofounder) - Reasoning, RLHF, & Plan for 2027 AGI

You'd want to be pretty careful when you do this kind of training if you see a lot of potentially scary capabilities, if those seem close.

Dwarkesh Podcast
John Schulman (OpenAI Cofounder) - Reasoning, RLHF, & Plan for 2027 AGI

I mean, I would say it's not something we have to be scared of right now because right now it's hard to get the models to do anything coherent.

Dwarkesh Podcast
John Schulman (OpenAI Cofounder) - Reasoning, RLHF, & Plan for 2027 AGI

But if they started to get really good, I think...

Dwarkesh Podcast
John Schulman (OpenAI Cofounder) - Reasoning, RLHF, & Plan for 2027 AGI

Yeah, I think we would want to, we would have to take some of these questions seriously and we would want to have a lot of evals that like sort of test them for misbehavior in the most, or I guess that's like for the alignment of the models.

Dwarkesh Podcast
John Schulman (OpenAI Cofounder) - Reasoning, RLHF, & Plan for 2027 AGI

We would want to check

Dwarkesh Podcast
John Schulman (OpenAI Cofounder) - Reasoning, RLHF, & Plan for 2027 AGI

uh, we want to check that, um, they're not gonna, um, they're not gonna sort of turn against us or something.

Dwarkesh Podcast
John Schulman (OpenAI Cofounder) - Reasoning, RLHF, & Plan for 2027 AGI

Uh, but you, you might also want to look for, uh, like discontinuous jumps and capabilities like, um, uh,

Dwarkesh Podcast
John Schulman (OpenAI Cofounder) - Reasoning, RLHF, & Plan for 2027 AGI

you'd wanna have lots of evals for the capabilities of the models.

Dwarkesh Podcast
John Schulman (OpenAI Cofounder) - Reasoning, RLHF, & Plan for 2027 AGI

I mean, also, I guess you'd also wanna make sure that whatever you're training on doesn't have any reason to make the model turn against you, which itself I think isn't, I would say there's like, that doesn't seem like the hardest thing to do.

Dwarkesh Podcast
John Schulman (OpenAI Cofounder) - Reasoning, RLHF, & Plan for 2027 AGI

I mean, if like the way we train them with RLHF,

Dwarkesh Podcast
John Schulman (OpenAI Cofounder) - Reasoning, RLHF, & Plan for 2027 AGI

That does feel, even though the models are very smart, it does feel very safe because the model is just trying to produce a message that is pleasing to a human.

Dwarkesh Podcast
John Schulman (OpenAI Cofounder) - Reasoning, RLHF, & Plan for 2027 AGI

And it has no concern about anything else in the world other than whether this text it produces

Dwarkesh Podcast
John Schulman (OpenAI Cofounder) - Reasoning, RLHF, & Plan for 2027 AGI

is approved.

Dwarkesh Podcast
John Schulman (OpenAI Cofounder) - Reasoning, RLHF, & Plan for 2027 AGI

So obviously if you were doing something where the model has, yeah, it's carrying out a long sequence of actions, which involve tools and everything, then it might have some incentive to do a lot of wacky things that wouldn't make sense to a human in the process of producing its final result.

Dwarkesh Podcast
John Schulman (OpenAI Cofounder) - Reasoning, RLHF, & Plan for 2027 AGI

But I guess it wouldn't necessarily have an incentive to do anything other than produce a very high quality output at the end.