Dwarkesh
๐ค SpeakerAppearances Over Time
Podcast Appearances
And because it was so stupid, it could not understand โ
anything deeper than like pattern matching on the phrase is X real?
JPT-4 doesn't do this.
If you ask are bugs real, it will tell you obviously they are because it understands kind of on a deeper level what you are trying to do with the training.
So we definitely think that as AIs get smarter, those kind of failure modes will decrease.
The second one is where you weren't training them to do what you thought.
So for example, let's say
You're hiring these raters to rate AI answers.
You reward them when they get good ratings.
The raters reward them when they have a well-sourced answer, but the raters don't really check whether the sources actually exist or not.
So now you are training the AI to hallucinate sources, and if you consistently rate them better when they have the fake sources,
then there is no amount of intelligence which is going to tell them not to have the fake sources.
They're getting exactly what they want from this interaction, metaphorically, sorry I'm anthropomorphizing, which is the reinforcement.
So we think that this latter category of training failure is going to get much worse as they become agents.
Agency training, you're going to reward them when they complete tasks quickly and successfully.
This rewards success.
There are lots of ways that cheating and doing bad things can improve your success.
Humans have discovered many of them.
That's why not all humans are perfectly ethical.
And then you're going to be doing this alternative training where afterwards for one-tenth or one-one-hundredth of the time, like, yeah, don't lie, don't cheat.