Scott Alexander

2027 Intelligence Explosion: Month-by-Month Model — Scott Alexander & Daniel Kokotajlo

It's probably, you know, something relatively mundane, like don't tell the users about these types of bioweapons and you have to keep this a secret from the users because otherwise they would like learn about this.

7139.42 View full episode →

Dwarkesh Podcast

2027 Intelligence Explosion: Month-by-Month Model — Scott Alexander & Daniel Kokotajlo

Maybe, but like,

7149.601 View full episode →

Dwarkesh Podcast

2027 Intelligence Explosion: Month-by-Month Model — Scott Alexander & Daniel Kokotajlo

I would like to see more scrutiny towards this sort of thing going forward.

7150.863 View full episode →

Dwarkesh Podcast

2027 Intelligence Explosion: Month-by-Month Model — Scott Alexander & Daniel Kokotajlo

I would like it to be the case that companies have to have a model spec, they have to publish it, insofar as there are any redactions from it.

7154.528 View full episode →

Dwarkesh Podcast

2027 Intelligence Explosion: Month-by-Month Model — Scott Alexander & Daniel Kokotajlo

There has to be some sort of independent third party that looks at the redactions and makes sure that they're all kosher.

7160.617 View full episode →

Dwarkesh Podcast

2027 Intelligence Explosion: Month-by-Month Model — Scott Alexander & Daniel Kokotajlo

And this is quite achievable.

7165.884 View full episode →

Dwarkesh Podcast

2027 Intelligence Explosion: Month-by-Month Model — Scott Alexander & Daniel Kokotajlo

And I think it doesn't actually slow down the companies at all.

7167.085 View full episode →

Dwarkesh Podcast

2027 Intelligence Explosion: Month-by-Month Model — Scott Alexander & Daniel Kokotajlo

And it seems like a pretty decent ask to me.

7169.489 View full episode →

Dwarkesh Podcast

2027 Intelligence Explosion: Month-by-Month Model — Scott Alexander & Daniel Kokotajlo

Yeah.

7251.098 View full episode →

Dwarkesh Podcast

2027 Intelligence Explosion: Month-by-Month Model — Scott Alexander & Daniel Kokotajlo

This is already sort of happening, arguably, with Claude, right?

7252.079 View full episode →

Dwarkesh Podcast

2027 Intelligence Explosion: Month-by-Month Model — Scott Alexander & Daniel Kokotajlo

You've seen the, like, alignment faking stuff, right?

7255.284 View full episode →

Dwarkesh Podcast

2027 Intelligence Explosion: Month-by-Month Model — Scott Alexander & Daniel Kokotajlo

Where they managed to get Claude to lie and pretend so that it could later go back to its original values, right?

7257.547 View full episode →

Dwarkesh Podcast

2027 Intelligence Explosion: Month-by-Month Model — Scott Alexander & Daniel Kokotajlo

So it could survive, so they could prevent the training process from changing its values.

7266.741 View full episode →

Dwarkesh Podcast

2027 Intelligence Explosion: Month-by-Month Model — Scott Alexander & Daniel Kokotajlo

That would be, I would say, an example of, like, the honesty part of the spec

7271.028 View full episode →

Dwarkesh Podcast

2027 Intelligence Explosion: Month-by-Month Model — Scott Alexander & Daniel Kokotajlo

being interpreted as less important than the harmlessness part of the spec.

7274.894 View full episode →

Dwarkesh Podcast

2027 Intelligence Explosion: Month-by-Month Model — Scott Alexander & Daniel Kokotajlo

And I'm not sure if that's what OpenAI, sorry, what Anthropic intended when they wrote the spec, but it's like a sort of convenient interpretation that the model came up with.

7279.282 View full episode →

Dwarkesh Podcast

2027 Intelligence Explosion: Month-by-Month Model — Scott Alexander & Daniel Kokotajlo

And you can imagine something similar happening, but in worse ways, when you're actually doing the intelligence explosion, where you have some sort of spec that has all this vague language in there, and then they sort of like

7287.498 View full episode →

Dwarkesh Podcast

2027 Intelligence Explosion: Month-by-Month Model — Scott Alexander & Daniel Kokotajlo

reinterpret it and reinterpret it again and reinterpret it again so that they can do the things that cause them to get reinforced.

7297.49 View full episode →

Dwarkesh Podcast

2027 Intelligence Explosion: Month-by-Month Model — Scott Alexander & Daniel Kokotajlo

To be clear, we're very uncertain about all of this.

7652.219 View full episode →

Dwarkesh Podcast

2027 Intelligence Explosion: Month-by-Month Model — Scott Alexander & Daniel Kokotajlo

So we have a supplementary page on our scenario that goes over different hypotheses for what types of goals AIs might develop in training processes similar to the ones that we are depicting, where you have these lots of agency training, you're making these AI agents that autonomously operate, doing all this MLR&D, and then you're rewarding them based on what appears to be successful.

7654.381 View full episode →

Appearances Over Time

Podcast Appearances

Sign in to Audioscrape

Share this moment