Menu
Sign In Search Podcasts Charts People & Topics Add Podcast API Pricing

Joe Carlsmith

👤 Person
1218 total appearances

Appearances Over Time

Podcast Appearances

Dwarkesh Podcast
Joe Carlsmith - Otherness and control in the age of AGI

And hopefully we'd get one before we reach this scenario, but like, okay, so here are the kind of five, um, five categories of like motivations the model could have.

Dwarkesh Podcast
Joe Carlsmith - Otherness and control in the age of AGI

And this hopefully maybe gets at this point about like, what does the model eventually do?

Dwarkesh Podcast
Joe Carlsmith - Otherness and control in the age of AGI

Okay.

Dwarkesh Podcast
Joe Carlsmith - Otherness and control in the age of AGI

So one category is, uh,

Dwarkesh Podcast
Joe Carlsmith - Otherness and control in the age of AGI

just like something super alien that has, you know, it's sort of like, oh, there's some weird correlate of easy to predict text or like there's some weird aesthetic for data structures that like the model, you know, early on pre-training or maybe now it's like developed that it like, you know, it really thinks things should kind of be like this.

Dwarkesh Podcast
Joe Carlsmith - Otherness and control in the age of AGI

There's something that's like quite alien to our cognition where we just like wouldn't recognize this as,

Dwarkesh Podcast
Joe Carlsmith - Otherness and control in the age of AGI

a thing at all, right?

Dwarkesh Podcast
Joe Carlsmith - Otherness and control in the age of AGI

Another category is something, a kind of crystallized instrumental drive that is more recognizable to us.

Dwarkesh Podcast
Joe Carlsmith - Otherness and control in the age of AGI

So you can imagine, like, AIs that develop, let's say, some, like, curiosity drive because that's, like,

Dwarkesh Podcast
Joe Carlsmith - Otherness and control in the age of AGI

broadly useful you mentioned like oh it's got different heuristics different like drives different kind of things that are kind of like values um and some of those might be actually somewhat similar to things that were um useful to humans and that ended up part of our terminal values in various ways so you know you can imagine curiosity you can imagine like various types of of option value like maybe it really want it intrinsically about maybe it values power itself um

Dwarkesh Podcast
Joe Carlsmith - Otherness and control in the age of AGI

It could value like survival or some analog of survival.

Dwarkesh Podcast
Joe Carlsmith - Otherness and control in the age of AGI

Those are possibilities too that could have been rewarded as sort of proxy drives at various stages of this process and that kind of made their way into the model's kind of terminal criteria.

Dwarkesh Podcast
Joe Carlsmith - Otherness and control in the age of AGI

A third category is some analog of reward where the model at some point has sort of part of its motivational system has fixated on a component of the reward process, right?

Dwarkesh Podcast
Joe Carlsmith - Otherness and control in the age of AGI

Like the humans approving of me or like...

Dwarkesh Podcast
Joe Carlsmith - Otherness and control in the age of AGI

numbers getting entered in this data center or like gradient descent doing, you know, updating me in this direction or something like that.

Dwarkesh Podcast
Joe Carlsmith - Otherness and control in the age of AGI

There's something in the reward process such that as it was trained, it's focusing on that thing and like, I really want the reward process to give me reward.

Dwarkesh Podcast
Joe Carlsmith - Otherness and control in the age of AGI

But in order for it to be of the type where then getting reward like motivates choosing the takeover option, it also needs to generalize such that it's concerned for reward has some sort of like long time horizon element.

Dwarkesh Podcast
Joe Carlsmith - Otherness and control in the age of AGI

So it like not only wants reward, it wants to like

Dwarkesh Podcast
Joe Carlsmith - Otherness and control in the age of AGI

protect the reward button for like some long period or something.

Dwarkesh Podcast
Joe Carlsmith - Otherness and control in the age of AGI

Another one is like some kind of messed up interpretation of some human-like concept.