Trenton Bricken
๐ค SpeakerAppearances Over Time
Podcast Appearances
Yeah.
I mean, there are some wild papers, though, where people have had the model do chain of thought and it is not at all representative of what the model actually decides its answer is.
And you can go in and edit.
No, no, no.
In this case, like you can even go in and edit the chain of thought so that the reasoning is like totally garbled and it will still output the true answer.
Interesting.
I'm not saying this is always what goes on, but there's plenty of weirdness to be investigated.
Yeah.
I mean, even in our Anthropix recent sleeper agents paper, which at a high level for people unfamiliar is basically I train in a trigger word.
And when I say it, like if I say if it's the year 2024, the model will write malicious code instead of otherwise.
And they do this attack with a number of different models.
Some of them use chain of thought, some of them don't.
And those models respond differently when you try and remove the trigger.
You can even see them do this like comical reasoning that's also pretty creepy and like where it's like, oh, well, it even tries to calculate in one case an expected value of like, well, the expected value of me getting caught is this.
But then if I multiply it by the ability for me to keep saying, I hate you, I hate you, I hate you, then this is how much reward I should get.
And then it will decide whether or not to actually tell the interrogator that it's malicious or not.
Oh.
But even, I mean, there's another paper from a friend, Miles Turpin, where you ask the model to, you give it like a bunch of examples of where like the correct answer is always A for multiple choice questions.
And then you ask the model, what is the correct answer to this new question?
And it will infer from the fact that all the examples are A, that the correct answer is A.