Joe Carlsmith

Joe Carlsmith - Otherness and control in the age of AGI

And hopefully we'd get one before we reach this scenario, but like, okay, so here are the kind of five, um, five categories of like motivations the model could have.

1583.9 View full episode →

Dwarkesh Podcast

Joe Carlsmith - Otherness and control in the age of AGI

And this hopefully maybe gets at this point about like, what does the model eventually do?

1593.371 View full episode →

Dwarkesh Podcast

Joe Carlsmith - Otherness and control in the age of AGI

Okay.

1596.274 View full episode →

Dwarkesh Podcast

Joe Carlsmith - Otherness and control in the age of AGI

So one category is, uh,

1597.115 View full episode →

Dwarkesh Podcast

Joe Carlsmith - Otherness and control in the age of AGI

just like something super alien that has, you know, it's sort of like, oh, there's some weird correlate of easy to predict text or like there's some weird aesthetic for data structures that like the model, you know, early on pre-training or maybe now it's like developed that it like, you know, it really thinks things should kind of be like this.

1599.764 View full episode →

Dwarkesh Podcast

Joe Carlsmith - Otherness and control in the age of AGI

There's something that's like quite alien to our cognition where we just like wouldn't recognize this as,

1615.193 View full episode →

Dwarkesh Podcast

Joe Carlsmith - Otherness and control in the age of AGI

a thing at all, right?

1620.723 View full episode →

Dwarkesh Podcast

Joe Carlsmith - Otherness and control in the age of AGI

Another category is something, a kind of crystallized instrumental drive that is more recognizable to us.

1622.767 View full episode →

Dwarkesh Podcast

Joe Carlsmith - Otherness and control in the age of AGI

So you can imagine, like, AIs that develop, let's say, some, like, curiosity drive because that's, like,

1629.558 View full episode →

Dwarkesh Podcast

Joe Carlsmith - Otherness and control in the age of AGI

broadly useful you mentioned like oh it's got different heuristics different like drives different kind of things that are kind of like values um and some of those might be actually somewhat similar to things that were um useful to humans and that ended up part of our terminal values in various ways so you know you can imagine curiosity you can imagine like various types of of option value like maybe it really want it intrinsically about maybe it values power itself um

1636.45 View full episode →

Dwarkesh Podcast

Joe Carlsmith - Otherness and control in the age of AGI

It could value like survival or some analog of survival.

1658.05 View full episode →

Dwarkesh Podcast

Joe Carlsmith - Otherness and control in the age of AGI

Those are possibilities too that could have been rewarded as sort of proxy drives at various stages of this process and that kind of made their way into the model's kind of terminal criteria.

1661.8 View full episode →

Dwarkesh Podcast

Joe Carlsmith - Otherness and control in the age of AGI

A third category is some analog of reward where the model at some point has sort of part of its motivational system has fixated on a component of the reward process, right?

1675.094 View full episode →

Dwarkesh Podcast

Joe Carlsmith - Otherness and control in the age of AGI

Like the humans approving of me or like...

1689.79 View full episode →

Dwarkesh Podcast

Joe Carlsmith - Otherness and control in the age of AGI

numbers getting entered in this data center or like gradient descent doing, you know, updating me in this direction or something like that.

1691.615 View full episode →

Dwarkesh Podcast

Joe Carlsmith - Otherness and control in the age of AGI

There's something in the reward process such that as it was trained, it's focusing on that thing and like, I really want the reward process to give me reward.

1696.402 View full episode →

Dwarkesh Podcast

Joe Carlsmith - Otherness and control in the age of AGI

But in order for it to be of the type where then getting reward like motivates choosing the takeover option, it also needs to generalize such that it's concerned for reward has some sort of like long time horizon element.

1704.993 View full episode →

Dwarkesh Podcast

Joe Carlsmith - Otherness and control in the age of AGI

So it like not only wants reward, it wants to like

1717.05 View full episode →

Dwarkesh Podcast

Joe Carlsmith - Otherness and control in the age of AGI

protect the reward button for like some long period or something.

1719.914 View full episode →

Dwarkesh Podcast

Joe Carlsmith - Otherness and control in the age of AGI

Another one is like some kind of messed up interpretation of some human-like concept.

1723.481 View full episode →

Appearances Over Time

Podcast Appearances

Sign in to Audioscrape

Share this moment