Dwarkesh

Dwarkesh Podcast

2027 Intelligence Explosion: Month-by-Month Model — Scott Alexander & Daniel Kokotajlo

And because it was so stupid, it could not understand –

7470.704 View full episode →

Dwarkesh Podcast

2027 Intelligence Explosion: Month-by-Month Model — Scott Alexander & Daniel Kokotajlo

anything deeper than like pattern matching on the phrase is X real?

7474.508 View full episode →

Dwarkesh Podcast

2027 Intelligence Explosion: Month-by-Month Model — Scott Alexander & Daniel Kokotajlo

JPT-4 doesn't do this.

7478.452 View full episode →

Dwarkesh Podcast

2027 Intelligence Explosion: Month-by-Month Model — Scott Alexander & Daniel Kokotajlo

If you ask are bugs real, it will tell you obviously they are because it understands kind of on a deeper level what you are trying to do with the training.

7479.814 View full episode →

Dwarkesh Podcast

2027 Intelligence Explosion: Month-by-Month Model — Scott Alexander & Daniel Kokotajlo

So we definitely think that as AIs get smarter, those kind of failure modes will decrease.

7487.001 View full episode →

Dwarkesh Podcast

2027 Intelligence Explosion: Month-by-Month Model — Scott Alexander & Daniel Kokotajlo

The second one is where you weren't training them to do what you thought.

7491.586 View full episode →

Dwarkesh Podcast

2027 Intelligence Explosion: Month-by-Month Model — Scott Alexander & Daniel Kokotajlo

So for example, let's say

7496.151 View full episode →

Dwarkesh Podcast

2027 Intelligence Explosion: Month-by-Month Model — Scott Alexander & Daniel Kokotajlo

You're hiring these raters to rate AI answers.

7498.353 View full episode →

Dwarkesh Podcast

2027 Intelligence Explosion: Month-by-Month Model — Scott Alexander & Daniel Kokotajlo

You reward them when they get good ratings.

7501.017 View full episode →

Dwarkesh Podcast

2027 Intelligence Explosion: Month-by-Month Model — Scott Alexander & Daniel Kokotajlo

The raters reward them when they have a well-sourced answer, but the raters don't really check whether the sources actually exist or not.

7503.041 View full episode →

Dwarkesh Podcast

2027 Intelligence Explosion: Month-by-Month Model — Scott Alexander & Daniel Kokotajlo

So now you are training the AI to hallucinate sources, and if you consistently rate them better when they have the fake sources,

7510.973 View full episode →

Dwarkesh Podcast

2027 Intelligence Explosion: Month-by-Month Model — Scott Alexander & Daniel Kokotajlo

then there is no amount of intelligence which is going to tell them not to have the fake sources.

7518.265 View full episode →

Dwarkesh Podcast

2027 Intelligence Explosion: Month-by-Month Model — Scott Alexander & Daniel Kokotajlo

They're getting exactly what they want from this interaction, metaphorically, sorry I'm anthropomorphizing, which is the reinforcement.

7522.512 View full episode →

Dwarkesh Podcast

2027 Intelligence Explosion: Month-by-Month Model — Scott Alexander & Daniel Kokotajlo

So we think that this latter category of training failure is going to get much worse as they become agents.

7529.863 View full episode →

Dwarkesh Podcast

2027 Intelligence Explosion: Month-by-Month Model — Scott Alexander & Daniel Kokotajlo

Agency training, you're going to reward them when they complete tasks quickly and successfully.

7537.975 View full episode →

Dwarkesh Podcast

2027 Intelligence Explosion: Month-by-Month Model — Scott Alexander & Daniel Kokotajlo

This rewards success.

7542.585 View full episode →

Dwarkesh Podcast

2027 Intelligence Explosion: Month-by-Month Model — Scott Alexander & Daniel Kokotajlo

There are lots of ways that cheating and doing bad things can improve your success.

7546.354 View full episode →

Dwarkesh Podcast

2027 Intelligence Explosion: Month-by-Month Model — Scott Alexander & Daniel Kokotajlo

Humans have discovered many of them.

7550.503 View full episode →

Dwarkesh Podcast

2027 Intelligence Explosion: Month-by-Month Model — Scott Alexander & Daniel Kokotajlo

That's why not all humans are perfectly ethical.

7552.147 View full episode →

Dwarkesh Podcast

2027 Intelligence Explosion: Month-by-Month Model — Scott Alexander & Daniel Kokotajlo

And then you're going to be doing this alternative training where afterwards for one-tenth or one-one-hundredth of the time, like, yeah, don't lie, don't cheat.

7554.833 View full episode →

Appearances Over Time

Podcast Appearances

Sign in to Audioscrape

Share this moment