Menu
Sign In Search Podcasts Charts People & Topics Add Podcast API Blog Pricing

Dwarkesh

๐Ÿ‘ค Speaker
1735 total appearances

Appearances Over Time

Podcast Appearances

Dwarkesh Podcast
2027 Intelligence Explosion: Month-by-Month Model โ€” Scott Alexander & Daniel Kokotajlo

And because it was so stupid, it could not understand โ€“

Dwarkesh Podcast
2027 Intelligence Explosion: Month-by-Month Model โ€” Scott Alexander & Daniel Kokotajlo

anything deeper than like pattern matching on the phrase is X real?

Dwarkesh Podcast
2027 Intelligence Explosion: Month-by-Month Model โ€” Scott Alexander & Daniel Kokotajlo

JPT-4 doesn't do this.

Dwarkesh Podcast
2027 Intelligence Explosion: Month-by-Month Model โ€” Scott Alexander & Daniel Kokotajlo

If you ask are bugs real, it will tell you obviously they are because it understands kind of on a deeper level what you are trying to do with the training.

Dwarkesh Podcast
2027 Intelligence Explosion: Month-by-Month Model โ€” Scott Alexander & Daniel Kokotajlo

So we definitely think that as AIs get smarter, those kind of failure modes will decrease.

Dwarkesh Podcast
2027 Intelligence Explosion: Month-by-Month Model โ€” Scott Alexander & Daniel Kokotajlo

The second one is where you weren't training them to do what you thought.

Dwarkesh Podcast
2027 Intelligence Explosion: Month-by-Month Model โ€” Scott Alexander & Daniel Kokotajlo

So for example, let's say

Dwarkesh Podcast
2027 Intelligence Explosion: Month-by-Month Model โ€” Scott Alexander & Daniel Kokotajlo

You're hiring these raters to rate AI answers.

Dwarkesh Podcast
2027 Intelligence Explosion: Month-by-Month Model โ€” Scott Alexander & Daniel Kokotajlo

You reward them when they get good ratings.

Dwarkesh Podcast
2027 Intelligence Explosion: Month-by-Month Model โ€” Scott Alexander & Daniel Kokotajlo

The raters reward them when they have a well-sourced answer, but the raters don't really check whether the sources actually exist or not.

Dwarkesh Podcast
2027 Intelligence Explosion: Month-by-Month Model โ€” Scott Alexander & Daniel Kokotajlo

So now you are training the AI to hallucinate sources, and if you consistently rate them better when they have the fake sources,

Dwarkesh Podcast
2027 Intelligence Explosion: Month-by-Month Model โ€” Scott Alexander & Daniel Kokotajlo

then there is no amount of intelligence which is going to tell them not to have the fake sources.

Dwarkesh Podcast
2027 Intelligence Explosion: Month-by-Month Model โ€” Scott Alexander & Daniel Kokotajlo

They're getting exactly what they want from this interaction, metaphorically, sorry I'm anthropomorphizing, which is the reinforcement.

Dwarkesh Podcast
2027 Intelligence Explosion: Month-by-Month Model โ€” Scott Alexander & Daniel Kokotajlo

So we think that this latter category of training failure is going to get much worse as they become agents.

Dwarkesh Podcast
2027 Intelligence Explosion: Month-by-Month Model โ€” Scott Alexander & Daniel Kokotajlo

Agency training, you're going to reward them when they complete tasks quickly and successfully.

Dwarkesh Podcast
2027 Intelligence Explosion: Month-by-Month Model โ€” Scott Alexander & Daniel Kokotajlo

This rewards success.

Dwarkesh Podcast
2027 Intelligence Explosion: Month-by-Month Model โ€” Scott Alexander & Daniel Kokotajlo

There are lots of ways that cheating and doing bad things can improve your success.

Dwarkesh Podcast
2027 Intelligence Explosion: Month-by-Month Model โ€” Scott Alexander & Daniel Kokotajlo

Humans have discovered many of them.

Dwarkesh Podcast
2027 Intelligence Explosion: Month-by-Month Model โ€” Scott Alexander & Daniel Kokotajlo

That's why not all humans are perfectly ethical.

Dwarkesh Podcast
2027 Intelligence Explosion: Month-by-Month Model โ€” Scott Alexander & Daniel Kokotajlo

And then you're going to be doing this alternative training where afterwards for one-tenth or one-one-hundredth of the time, like, yeah, don't lie, don't cheat.