Trenton Bricken

Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

So it's one head's job to just always copy the token that came before, for example, or the token that came five before or whatever.

8374.953 View full episode →

Dwarkesh Podcast

Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

And then it's another head's job to be like, no, do not copy that thing.

8383.401 View full episode →

Dwarkesh Podcast

Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

So there are lots of different circuits performing in these cases, pretty basic operations, but when they're chained together, you can get unique behaviors.

8387.024 View full episode →

Dwarkesh Podcast

Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

I think a lot of analysis like that.

8424.183 View full episode →

Dwarkesh Podcast

Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

Like Anthropic has done quite a bit of research before on sycophancy, which is like the model saying what it thinks you want to hear.

8426.887 View full episode →

Dwarkesh Podcast

Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

Yeah.

8440.889 View full episode →

Dwarkesh Podcast

Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

So we have tons of instances.

8441.129 View full episode →

Dwarkesh Podcast

Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

And actually, as you make models larger, they do more of this where the model is clearly it has like features that model another person's mind.

8442.151 View full episode →

Dwarkesh Podcast

Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

And these activate and like some subset of these we're hypothesizing here, but like would be associated with more deceptive behavior.

8452.752 View full episode →

Dwarkesh Podcast

Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

Yeah, so a lot to unpack here.

8499.428 View full episode →

Dwarkesh Podcast

Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

So I guess a few things.

8501.232 View full episode →

Dwarkesh Podcast

Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

One, it's fantastic that these models are deterministic.

8503.097 View full episode →

Dwarkesh Podcast

Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

When you sample from them, it's stochastic, right?

8507.407 View full episode →

Dwarkesh Podcast

Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

But I can just keep putting in more inputs and ablate every single part of the model.

8509.171 View full episode →

Dwarkesh Podcast

Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

This is kind of the pitch for computational neuroscientists to come and work on interpretability.

8514.663 View full episode →

Dwarkesh Podcast

Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

you have this alien brain and you have access to everything in it and you can just ablate however much of it you want.

8518.312 View full episode →

Dwarkesh Podcast

Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

And so I think if you do this carefully enough, you really can start to pin down what are the circuits involved?

8523.657 View full episode →

Dwarkesh Podcast

Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

What are the backup circuits?

8528.342 View full episode →

Dwarkesh Podcast

Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

These sorts of things.

8529.683 View full episode →

Dwarkesh Podcast

Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

The kind of cop-out answer here, but it's important to keep in mind is doing automated interpretability.

8531.485 View full episode →

Appearances Over Time

Podcast Appearances

Sign in to Audioscrape

Share this moment