Andrew Ilyas
๐ค SpeakerAppearances Over Time
Podcast Appearances
on clean pictures of dogs and cats.
And so what that means is that this useful versus useless features dichotomy is not enough to explain what happened here.
There has to be something else at play.
And so by doing this experiment, we basically gave support for this model of non-robust features, whereby not only are there useless features and useful features,
But there are actually, within the class of useful features, there are features that are genuinely useful for the classification task at hand, but just happen to be not robust to small perturbations.
And so a very toy example of what that would look like is imagine if there were a small pixel in the top right of every single example that told you exactly what the class was.
Then you wouldn't have to learn anything about the image.
You could just get 100% accuracy by looking at this pixel.
But if you did that, an adversary could come in and by changing just that one pixel completely flip your classification.
And so what we're showing is that obviously this one pixel scenario isn't real, but there are features like this in sort of standard image data sets.
That's a great question.
I think it gets back to this idea of disambiguating different notions of robustness.
And so ideally, we want our models to be robust in every sense of the word robustness.
If they were truly learning human-level features, then we'd want them to A, be adversarially robust,
because humans are adversarially robust.
So we'd like that.
B, we'd like them to be like sort of data space robust.
And that's sort of what we were talking about earlier that like, you know, your prediction should not hinge on like one or two examples being removed from your training set, because that feels like kind of this exemplar based, not really abstract,
like this memorization behavior that we were just talking about.
And then three, which is something that I haven't explored, but also I know is being explored right now, you also want them to learn features that are causally robust, in the sense that you want it to learn to avoid spurious features and