Menu
Sign In Search Podcasts Libraries Charts People & Topics Add Podcast API Blog Pricing

Scott Alexander

πŸ‘€ Speaker
4620 total appearances

Appearances Over Time

Podcast Appearances

Dwarkesh Podcast
2027 Intelligence Explosion: Month-by-Month Model β€” Scott Alexander & Daniel Kokotajlo

It's probably, you know, something relatively mundane, like don't tell the users about these types of bioweapons and you have to keep this a secret from the users because otherwise they would like learn about this.

Dwarkesh Podcast
2027 Intelligence Explosion: Month-by-Month Model β€” Scott Alexander & Daniel Kokotajlo

Maybe, but like,

Dwarkesh Podcast
2027 Intelligence Explosion: Month-by-Month Model β€” Scott Alexander & Daniel Kokotajlo

I would like to see more scrutiny towards this sort of thing going forward.

Dwarkesh Podcast
2027 Intelligence Explosion: Month-by-Month Model β€” Scott Alexander & Daniel Kokotajlo

I would like it to be the case that companies have to have a model spec, they have to publish it, insofar as there are any redactions from it.

Dwarkesh Podcast
2027 Intelligence Explosion: Month-by-Month Model β€” Scott Alexander & Daniel Kokotajlo

There has to be some sort of independent third party that looks at the redactions and makes sure that they're all kosher.

Dwarkesh Podcast
2027 Intelligence Explosion: Month-by-Month Model β€” Scott Alexander & Daniel Kokotajlo

And this is quite achievable.

Dwarkesh Podcast
2027 Intelligence Explosion: Month-by-Month Model β€” Scott Alexander & Daniel Kokotajlo

And I think it doesn't actually slow down the companies at all.

Dwarkesh Podcast
2027 Intelligence Explosion: Month-by-Month Model β€” Scott Alexander & Daniel Kokotajlo

And it seems like a pretty decent ask to me.

Dwarkesh Podcast
2027 Intelligence Explosion: Month-by-Month Model β€” Scott Alexander & Daniel Kokotajlo

Yeah.

Dwarkesh Podcast
2027 Intelligence Explosion: Month-by-Month Model β€” Scott Alexander & Daniel Kokotajlo

This is already sort of happening, arguably, with Claude, right?

Dwarkesh Podcast
2027 Intelligence Explosion: Month-by-Month Model β€” Scott Alexander & Daniel Kokotajlo

You've seen the, like, alignment faking stuff, right?

Dwarkesh Podcast
2027 Intelligence Explosion: Month-by-Month Model β€” Scott Alexander & Daniel Kokotajlo

Where they managed to get Claude to lie and pretend so that it could later go back to its original values, right?

Dwarkesh Podcast
2027 Intelligence Explosion: Month-by-Month Model β€” Scott Alexander & Daniel Kokotajlo

So it could survive, so they could prevent the training process from changing its values.

Dwarkesh Podcast
2027 Intelligence Explosion: Month-by-Month Model β€” Scott Alexander & Daniel Kokotajlo

That would be, I would say, an example of, like, the honesty part of the spec

Dwarkesh Podcast
2027 Intelligence Explosion: Month-by-Month Model β€” Scott Alexander & Daniel Kokotajlo

being interpreted as less important than the harmlessness part of the spec.

Dwarkesh Podcast
2027 Intelligence Explosion: Month-by-Month Model β€” Scott Alexander & Daniel Kokotajlo

And I'm not sure if that's what OpenAI, sorry, what Anthropic intended when they wrote the spec, but it's like a sort of convenient interpretation that the model came up with.

Dwarkesh Podcast
2027 Intelligence Explosion: Month-by-Month Model β€” Scott Alexander & Daniel Kokotajlo

And you can imagine something similar happening, but in worse ways, when you're actually doing the intelligence explosion, where you have some sort of spec that has all this vague language in there, and then they sort of like

Dwarkesh Podcast
2027 Intelligence Explosion: Month-by-Month Model β€” Scott Alexander & Daniel Kokotajlo

reinterpret it and reinterpret it again and reinterpret it again so that they can do the things that cause them to get reinforced.

Dwarkesh Podcast
2027 Intelligence Explosion: Month-by-Month Model β€” Scott Alexander & Daniel Kokotajlo

To be clear, we're very uncertain about all of this.

Dwarkesh Podcast
2027 Intelligence Explosion: Month-by-Month Model β€” Scott Alexander & Daniel Kokotajlo

So we have a supplementary page on our scenario that goes over different hypotheses for what types of goals AIs might develop in training processes similar to the ones that we are depicting, where you have these lots of agency training, you're making these AI agents that autonomously operate, doing all this MLR&D, and then you're rewarding them based on what appears to be successful.