Trenton Bricken

Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

I mean, probably one of the ones that we publicly, well, publicly, one of the ones that we shared in our update.

Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

So I think there were some related to like love and like, um, sudden changes in scene, particularly associated with like wars being declared.

10632.769 View full episode →

Dwarkesh Podcast

Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

There are like a few of them in there and that in that post, if you want to link to it.

10640.728 View full episode →

Dwarkesh Podcast

Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

Yeah.

10643.971 View full episode →

Dwarkesh Podcast

Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

But even like Bruno Olshausen had a paper back in 2018, 19, where they applied a similar technique to a BERT model and found that as you go to deeper layers of the model, things become more abstract.

10645.473 View full episode →

Dwarkesh Podcast

Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

So I remember like in the earlier layers, there'd be a feature that would just fire for the word park.

10655.865 View full episode →

Dwarkesh Podcast

Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

But later on, there was a feature that fired for park as like a last name, like Lincoln Park, or like it's like a common Korean last name as well.

10659.389 View full episode →

Dwarkesh Podcast

Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

And then there was a separate feature that would fire for parks as like grassy areas.

10665.877 View full episode →

Dwarkesh Podcast

Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

So there's other work that points in this direction.

10670.823 View full episode →

Dwarkesh Podcast

Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

Yeah, I really want to do more work.

10758.957 View full episode →

Dwarkesh Podcast

Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

I guess the sleeper agents is in this direction of like what happens to a model when you fine tune it when you are LHF at these sorts of things.

10761.561 View full episode →

Dwarkesh Podcast

Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

I mean, maybe it's trite, but you could just say like you conclude that people contain multitudes, right?

10767.97 View full episode →

Dwarkesh Podcast

Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

In so much as they have lots of different features.

10773.378 View full episode →

Dwarkesh Podcast

Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

There's even the stuff related to the Waluigi effects of like in order to know what's good or bad, you need to understand both of those concepts.

10776.322 View full episode →

Dwarkesh Podcast

Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

And so we might have to have models that are aware of violence and have been trained on it in order to recognize it.

10782.07 View full episode →

Dwarkesh Podcast

Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

Can you post-hoc identify those features and ablate them in a way where maybe your model's slightly naive, but you know that it's not going to be really evil?

10787.999 View full episode →

Dwarkesh Podcast

Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

Totally, that's in our toolkit, which seems great.

10796.011 View full episode →

Dwarkesh Podcast

Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

Oh, really?

10798.095 View full episode →

Dwarkesh Podcast

Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

pathways or whatever you modify like and then the pathway to you looks like you just change those but you were mentioning earlier there's a bunch of redundancy in the model yeah so you need to account for all that but but we have um a much better microscope into this now than we used to like sharper tools for making edits and it seems like at least from my perspective that seems like one of the the primary way of uh like to some degree

10805.566 View full episode →

Dwarkesh Podcast

Sholto Douglas & Trenton Bricken - How to Build & Understand GPT-7's Mind

The superhuman feature question is a very good one.

10904.898 View full episode →

Appearances Over Time

Podcast Appearances

Sign in to Audioscrape

Share this moment