Trenton Bricken
๐ค SpeakerAppearances Over Time
Podcast Appearances
That's a red flag.
You could also coarse grain it so that it's just a single base 64 feature.
I mean, even the fact that this came up and we could see that it specifically favors these particular outputs and it fires for these particular inputs gets you a lot of the way there.
I'm even familiar of cases from the auto interp side where a human will look at a feature and try to annotate it for it fires for auto.
latin words and then when you ask the model to classify it it says it fires for latin words defining plants so it can like already like beat the human in some cases for like labeling what's going on so at scale this would require an adversarial um uh
Yeah, but you can even automate this process, right?
I mean, this goes back to the determinism of the model.
You could have a model that is actively editing input text and predicting if the feature is going to fire or not, and figure out what makes it fire, what doesn't, and search the space.
Especially for scalability.
I think it's underappreciated right now.
I mean, so at some point, I think you might just start fitting noise or things that are part of the data, but that the model isn't actually representing.
Yeah, yeah.
So it's the part before where the model will learn however many features it has capacity for that still span the space of representation.
Yeah, so if you don't give the model that much capacity for the features it's learning, concretely, if you project to not as high a dimensional space,
it will learn one feature for birds.
But if you give the model more capacity, it will learn features for all the different types of birds.
And so it's more specific than otherwise.
And oftentimes, there's the bird vector that points in one direction, and all the other specific types of birds point in a similar region of the space, but are obviously more specific than the course label.
Yeah, so you do dictionary learning after you've trained your model.
And you feed it a ton of inputs.