Trenton Bricken

Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

Yeah.

Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

I mean, there are some wild papers, though, where people have had the model do chain of thought and it is not at all representative of what the model actually decides its answer is.

4658.244 View full episode →

Dwarkesh Podcast

Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

And you can go in and edit.

4667.964 View full episode →

Dwarkesh Podcast

Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

No, no, no.

4670.108 View full episode →

Dwarkesh Podcast

Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

In this case, like you can even go in and edit the chain of thought so that the reasoning is like totally garbled and it will still output the true answer.

4670.529 View full episode →

Dwarkesh Podcast

Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

Interesting.

4694.538 View full episode →

Dwarkesh Podcast

Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

I'm not saying this is always what goes on, but there's plenty of weirdness to be investigated.

4695.9 View full episode →

Dwarkesh Podcast

Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

Yeah.

4713.286 View full episode →

Dwarkesh Podcast

Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

I mean, even in our Anthropix recent sleeper agents paper, which at a high level for people unfamiliar is basically I train in a trigger word.

4713.747 View full episode →

Dwarkesh Podcast

Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

And when I say it, like if I say if it's the year 2024, the model will write malicious code instead of otherwise.

4725.203 View full episode →

Dwarkesh Podcast

Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

And they do this attack with a number of different models.

4731.292 View full episode →

Dwarkesh Podcast

Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

Some of them use chain of thought, some of them don't.

4735.178 View full episode →

Dwarkesh Podcast

Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

And those models respond differently when you try and remove the trigger.

4738.082 View full episode →

Dwarkesh Podcast

Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

You can even see them do this like comical reasoning that's also pretty creepy and like where it's like, oh, well, it even tries to calculate in one case an expected value of like, well, the expected value of me getting caught is this.

4743.05 View full episode →

Dwarkesh Podcast

Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

But then if I multiply it by the ability for me to keep saying, I hate you, I hate you, I hate you, then this is how much reward I should get.

4755.389 View full episode →

Dwarkesh Podcast

Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

And then it will decide whether or not to actually tell the interrogator that it's malicious or not.

4763.365 View full episode →

Dwarkesh Podcast

Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

Oh.

4771 View full episode →

Dwarkesh Podcast

Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

But even, I mean, there's another paper from a friend, Miles Turpin, where you ask the model to, you give it like a bunch of examples of where like the correct answer is always A for multiple choice questions.

4772.483 View full episode →

Dwarkesh Podcast

Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

And then you ask the model, what is the correct answer to this new question?

4786.137 View full episode →

Dwarkesh Podcast

Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

And it will infer from the fact that all the examples are A, that the correct answer is A.

4790.542 View full episode →

Appearances Over Time

Podcast Appearances

Sign in to Audioscrape

Share this moment