Menu
Sign In Search Podcasts Charts People & Topics Add Podcast API Blog Pricing

Sholto Douglas

๐Ÿ‘ค Speaker
1567 total appearances

Appearances Over Time

Podcast Appearances

Dwarkesh Podcast
Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

By the way, for the audience, linear probe is you just, like, classify the activations.

Dwarkesh Podcast
Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

I don't know.

Dwarkesh Podcast
Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

From what I vaguely remember about the paper was, like, if it's, like, telling a lie, then you, like, you just train a classifier on, like, is it, yeah, in the end, was it a lie or was it just, like, wrong or something?

Dwarkesh Podcast
Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

I don't know.

Dwarkesh Podcast
Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

Yeah, it's, like, a classifier on activations.

Dwarkesh Podcast
Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

And it's like... So you've done the projecting out to the million whatever features or something.

Dwarkesh Podcast
Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

Mm-hmm.

Dwarkesh Podcast
Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

Is a circuit, because maybe we're using feature and circuit interchangeably when they're not.

Dwarkesh Podcast
Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

So is there like a deception?

Dwarkesh Podcast
Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

But doesn't all this require you to have labels for all those examples?

Dwarkesh Podcast
Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

And if you have those labels, then whatever faults that the linear probe has on the, maybe you've labeled a long thing or whatever, wouldn't the same thing apply to the labels you've come up with for the unsupervised features you've come up with?

Dwarkesh Podcast
Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

And I guess the hope is you found a bunch of things that light up when it's being deceptive, and then you can figure out why some of those things are lighting up in this part of the distribution and not this other part and so forth.

Dwarkesh Podcast
Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

Do you anticipate you'll be able to understand?

Dwarkesh Podcast
Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

I don't know, the current models you've studied are pretty basic, right?

Dwarkesh Podcast
Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

Do you think you'll be able to understand why GPT-7 fires in certain domains but not in other domains?

Dwarkesh Podcast
Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

What is the highest level feature you've found so far?

Dwarkesh Podcast
Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

Like it's basically for whatever it's like, maybe it's just like, um, and the symbolic species language, the book you recommended, there's like indexical, uh, things where you're just, I forgot what all the labels were, but like, there's things where you're just like, uh, you see a tiger and you're like run and whatever, you know, just like a very sort of behaviorist thing.

Dwarkesh Podcast
Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

And then there's like a higher level of which, uh, what I refer to love, it refers to like a movie scene or my girlfriend or whatever, you know what I mean?

Dwarkesh Podcast
Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

So it's like the top of the tent.

Dwarkesh Podcast
Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

What is the highest level?