Trenton Bricken

Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

You're saying that this feature fires for marriage, but if you activate it really strongly, it doesn't change the outputs of the model in a way that would correspond to it.

7999.372 View full episode →

Dwarkesh Podcast

Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

I think these would both be good critiques.

8008.384 View full episode →

Dwarkesh Podcast

Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

I guess one more is, and we tried to do experiments on MNIST, which is a data set of digits, images.

8011.588 View full episode →

Dwarkesh Podcast

Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

And we didn't look super hard into it.

8018.677 View full episode →

Dwarkesh Podcast

Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

And so I'd be interested if other people wanted to take up a deeper investigation.

8021.02 View full episode →

Dwarkesh Podcast

Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

But it's plausible that your latent space of representations is dense and it's a manifold instead of being these discrete points.

8025.146 View full episode →

Dwarkesh Podcast

Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

And so you could move across the manifold, but at every point there would be some meaningful behavior.

8034.4 View full episode →

Dwarkesh Podcast

Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

And it's much harder then to label things as features that are discrete.

8040.829 View full episode →

Dwarkesh Podcast

Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

Yeah, so ideally, you apply dictionary learning to the model.

8170.248 View full episode →

Dwarkesh Podcast

Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

You've found features.

8173.793 View full episode →

Dwarkesh Podcast

Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

Right now, we're actively trying to get the same success for attention heads, in which case we have features for both the core.

8176.457 View full episode →

Dwarkesh Podcast

Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

You can do it for residual stream, MLP, and attention throughout the whole model.

8181.945 View full episode →

Dwarkesh Podcast

Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

Hopefully, at that point, you can also identify broader circuits through the model that are more general reasoning abilities that will activate or not activate.

8186.272 View full episode →

Dwarkesh Podcast

Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

But in your case, where we're trying to figure out if this pull request should be approved or not,

8194.304 View full episode →

Dwarkesh Podcast

Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

I think you can flag or detect features that correspond to deceptive behavior, malicious behavior, these sorts of things and see whether or not those have fired.

8199.712 View full episode →

Dwarkesh Podcast

Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

That would be like an immediate, you can do more than that, but that would be an immediate.

8209.096 View full episode →

Dwarkesh Podcast

Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

Yeah.

8219.156 View full episode →

Dwarkesh Podcast

Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

So, I mean, the induction head is probably one of the simplest.