Menu
Sign In Search Podcasts Charts People & Topics Add Podcast API Blog Pricing

Trenton Bricken

๐Ÿ‘ค Speaker
See mentions of this person in podcasts
1589 total appearances

Appearances Over Time

Podcast Appearances

Dwarkesh Podcast
Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

Yeah.

Dwarkesh Podcast
Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

I mean, there are some wild papers, though, where people have had the model do chain of thought and it is not at all representative of what the model actually decides its answer is.

Dwarkesh Podcast
Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

And you can go in and edit.

Dwarkesh Podcast
Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

No, no, no.

Dwarkesh Podcast
Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

In this case, like you can even go in and edit the chain of thought so that the reasoning is like totally garbled and it will still output the true answer.

Dwarkesh Podcast
Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

Interesting.

Dwarkesh Podcast
Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

I'm not saying this is always what goes on, but there's plenty of weirdness to be investigated.

Dwarkesh Podcast
Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

Yeah.

Dwarkesh Podcast
Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

I mean, even in our Anthropix recent sleeper agents paper, which at a high level for people unfamiliar is basically I train in a trigger word.

Dwarkesh Podcast
Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

And when I say it, like if I say if it's the year 2024, the model will write malicious code instead of otherwise.

Dwarkesh Podcast
Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

And they do this attack with a number of different models.

Dwarkesh Podcast
Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

Some of them use chain of thought, some of them don't.

Dwarkesh Podcast
Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

And those models respond differently when you try and remove the trigger.

Dwarkesh Podcast
Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

You can even see them do this like comical reasoning that's also pretty creepy and like where it's like, oh, well, it even tries to calculate in one case an expected value of like, well, the expected value of me getting caught is this.

Dwarkesh Podcast
Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

But then if I multiply it by the ability for me to keep saying, I hate you, I hate you, I hate you, then this is how much reward I should get.

Dwarkesh Podcast
Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

And then it will decide whether or not to actually tell the interrogator that it's malicious or not.

Dwarkesh Podcast
Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

Oh.

Dwarkesh Podcast
Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

But even, I mean, there's another paper from a friend, Miles Turpin, where you ask the model to, you give it like a bunch of examples of where like the correct answer is always A for multiple choice questions.

Dwarkesh Podcast
Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

And then you ask the model, what is the correct answer to this new question?

Dwarkesh Podcast
Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

And it will infer from the fact that all the examples are A, that the correct answer is A.