Andrew Ilyas
๐ค SpeakerAppearances Over Time
Podcast Appearances
Yeah, absolutely.
And that's something that was observed by a variety of papers before ours.
Whether that's some information theoretic loss of accuracy because you've, like you've said, blindfolded the model from a bunch of features is unclear because on the one hand, we know humans can adversarially robustly classify images.
So it's probably possible to get that level of accuracy.
I think what really happens is some combination of, you know, you're blindfolding the model from learning these features that it would normally learn, and you're basically asking it to instead learn a bunch of features that are much harder to learn robustly.
And so there's this paper that I saw in archive just a couple of days ago that I think is at ICML, where they basically tried to build scaling laws for adversarially robust training.
And they basically forecasted that if we had like 10 to the 30 flops,
we could train, you know, adversarial robust networks that are like, you know, have state of the art accuracy and like match human accuracy in terms of their adversarial robustness.
But you need, you know, tons of data and tons of compute
Yeah, absolutely.
I think that the sort of core of the features versus bugs question is really whether or not there exists information within sort of adversarial examples that's sufficient for good generalization within the rest of the data.
um it's sort of this question about like our adversarial examples happening along some different axis than just like normal misclassified examples um and i think that like something that's uh if i'm thinking of the right work um something that's both supported by that work and by our work is that there isn't really that significant a distinction between you know an adversarial example and any other misclassified input like when someone looks at a misclassified input they're not like oh this is so mysterious they're just like okay
I learned the wrong features.
Features happen to point the other way on this image.
Tough luck.
We'll get them next time.
And so I think if you think about adversarial examples from that perspective, what we're showing is exactly that on these adversarial examples, it just so happens that the features that your model learned are pointing the wrong way.
And I think it's really nice that this paper basically used adversarial examples and adversarial robustness as a way of setting just like the features used by a neural network.
Something that I really appreciate.
Yeah, so that was one of the papers I mentioned that I did during my undergrad of when I first got hooked on this concept of adversarial examples.