Trenton Bricken

Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

So it's like, as our models continue to get more capable, having them assign labels or like run some of these experiments at scale.

8536.63 View full episode →

Dwarkesh Podcast

Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

And then with respect to like,

8542.716 View full episode →

Dwarkesh Podcast

Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

If there's superhuman performance, how do you detect it?

8545.058 View full episode →

Dwarkesh Podcast

Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

Which I think was kind of the last part of your question.

8548.121 View full episode →

Dwarkesh Podcast

Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

Aside from the cop-out answer, if we buy this associations all the way down, you should be able to coarse-grain the representations at a certain level such that they then make sense.

8550.063 View full episode →

Dwarkesh Podcast

Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

I think it was even in Demis's podcast, he's talking about if a chess player makes a superhuman move, they should be able to distill it into reasons why they did it.

8562.475 View full episode →

Dwarkesh Podcast

Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

And even if the model's not gonna tell you what it is, you should be able to decompose that complex behavior into simpler circuits or features to really start to make sense of why it did the thing that it did.

8572.345 View full episode →

Dwarkesh Podcast

Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

Yes and no.

8606.247 View full episode →

Dwarkesh Podcast

Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

So like we are actively trying to use dictionary learning now on the sleeper agents work, which we talked about earlier.

8608.753 View full episode →

Dwarkesh Podcast

Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

And it's like, if I just give you a model, can you tell me if there's this trigger and it's going to start doing interesting behavior?

8615.387 View full episode →

Dwarkesh Podcast

Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

And it's an open question whether or not when it learns that behavior, it's part of a more general circuit.

8621.04 View full episode →

Dwarkesh Podcast

Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

that we can pick up on without actually getting activations for and having it display that behavior, right?

8625.67 View full episode →

Dwarkesh Podcast

Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

Because that would kind of be cheating then.

8631.578 View full episode →

Dwarkesh Podcast

Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

Or if it's learning some hacky trick over, like that's a separate circuit that you'll only pick up on if you actually have it do that behavior.

8633.46 View full episode →

Dwarkesh Podcast

Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

But even in that case, the geometry of features gets really interesting because fundamentally each feature is in some part of your representation space.

8642.092 View full episode →

Dwarkesh Podcast

Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

And they all exist with respect to each other.

8653.126 View full episode →

Dwarkesh Podcast

Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

And so in order to have this new behavior, you need to carve out some subset of the feature space for the new behavior and then push everything else out of the way to make space for it.

8655.69 View full episode →

Dwarkesh Podcast

Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

So hypothetically, you can imagine you have your model before you've taught it this bad behavior.

8665.204 View full episode →

Dwarkesh Podcast

Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

you know all the features or have some coarse-grained representation of them.

8670.271 View full episode →

Dwarkesh Podcast

Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

You then fine-tune it such that it becomes malicious.

8673.655 View full episode →

Appearances Over Time

Podcast Appearances

Sign in to Audioscrape

Share this moment