Trenton Bricken
๐ค SpeakerAppearances Over Time
Podcast Appearances
I mean, probably one of the ones that we publicly, well, publicly, one of the ones that we shared in our update.
So I think there were some related to like love and like, um, sudden changes in scene, particularly associated with like wars being declared.
There are like a few of them in there and that in that post, if you want to link to it.
Yeah.
But even like Bruno Olshausen had a paper back in 2018, 19, where they applied a similar technique to a BERT model and found that as you go to deeper layers of the model, things become more abstract.
So I remember like in the earlier layers, there'd be a feature that would just fire for the word park.
But later on, there was a feature that fired for park as like a last name, like Lincoln Park, or like it's like a common Korean last name as well.
And then there was a separate feature that would fire for parks as like grassy areas.
So there's other work that points in this direction.
Yeah, I really want to do more work.
I guess the sleeper agents is in this direction of like what happens to a model when you fine tune it when you are LHF at these sorts of things.
I mean, maybe it's trite, but you could just say like you conclude that people contain multitudes, right?
In so much as they have lots of different features.
There's even the stuff related to the Waluigi effects of like in order to know what's good or bad, you need to understand both of those concepts.
And so we might have to have models that are aware of violence and have been trained on it in order to recognize it.
Can you post-hoc identify those features and ablate them in a way where maybe your model's slightly naive, but you know that it's not going to be really evil?
Totally, that's in our toolkit, which seems great.
Oh, really?
pathways or whatever you modify like and then the pathway to you looks like you just change those but you were mentioning earlier there's a bunch of redundancy in the model yeah so you need to account for all that but but we have um a much better microscope into this now than we used to like sharper tools for making edits and it seems like at least from my perspective that seems like one of the the primary way of uh like to some degree
The superhuman feature question is a very good one.