Menu
Sign In Search Podcasts Charts People & Topics Add Podcast API Blog Pricing

Trenton Bricken

๐Ÿ‘ค Speaker
See mentions of this person in podcasts
1589 total appearances

Appearances Over Time

Podcast Appearances

Dwarkesh Podcast
Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

I mean, probably one of the ones that we publicly, well, publicly, one of the ones that we shared in our update.

Dwarkesh Podcast
Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

So I think there were some related to like love and like, um, sudden changes in scene, particularly associated with like wars being declared.

Dwarkesh Podcast
Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

There are like a few of them in there and that in that post, if you want to link to it.

Dwarkesh Podcast
Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

Yeah.

Dwarkesh Podcast
Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

But even like Bruno Olshausen had a paper back in 2018, 19, where they applied a similar technique to a BERT model and found that as you go to deeper layers of the model, things become more abstract.

Dwarkesh Podcast
Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

So I remember like in the earlier layers, there'd be a feature that would just fire for the word park.

Dwarkesh Podcast
Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

But later on, there was a feature that fired for park as like a last name, like Lincoln Park, or like it's like a common Korean last name as well.

Dwarkesh Podcast
Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

And then there was a separate feature that would fire for parks as like grassy areas.

Dwarkesh Podcast
Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

So there's other work that points in this direction.

Dwarkesh Podcast
Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

Yeah, I really want to do more work.

Dwarkesh Podcast
Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

I guess the sleeper agents is in this direction of like what happens to a model when you fine tune it when you are LHF at these sorts of things.

Dwarkesh Podcast
Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

I mean, maybe it's trite, but you could just say like you conclude that people contain multitudes, right?

Dwarkesh Podcast
Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

In so much as they have lots of different features.

Dwarkesh Podcast
Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

There's even the stuff related to the Waluigi effects of like in order to know what's good or bad, you need to understand both of those concepts.

Dwarkesh Podcast
Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

And so we might have to have models that are aware of violence and have been trained on it in order to recognize it.

Dwarkesh Podcast
Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

Can you post-hoc identify those features and ablate them in a way where maybe your model's slightly naive, but you know that it's not going to be really evil?

Dwarkesh Podcast
Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

Totally, that's in our toolkit, which seems great.

Dwarkesh Podcast
Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

Oh, really?

Dwarkesh Podcast
Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

pathways or whatever you modify like and then the pathway to you looks like you just change those but you were mentioning earlier there's a bunch of redundancy in the model yeah so you need to account for all that but but we have um a much better microscope into this now than we used to like sharper tools for making edits and it seems like at least from my perspective that seems like one of the the primary way of uh like to some degree

Dwarkesh Podcast
Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

The superhuman feature question is a very good one.