Andrew Ilyas
๐ค SpeakerAppearances Over Time
Podcast Appearances
sort of the open AI scale.
But I don't think we're at a lack of interesting questions that can be studied at our scale.
I think there are tons of them.
Absolutely.
OK, so taking a step back, the context here is we're looking at adversarial examples in the vision context, which is this phenomenon where you can add a very small perturbation to a natural image.
And by adding that small perturbation, a machine learning model that normally does very, very well will consistently misbehave whenever it sees these perturbed images.
And so we were trying to figure out why.
And I would say that the paper sort of centered around one experiment and one result that we found particularly surprising.
So just to set the stage a little bit, I think at the time when we were writing this paper,
the sort of conceptual model that people had of adversarial examples is that they were somehow what we refer to as bugs.
But you can think of these in a variety of ways.
They're essentially like when you make these adversarial examples, you're really sort of like, there are a bunch of words for it, like leaving the image manifold, you're sort of adding something, you're adding something useless.
There is this intuition that when you train a neural network, it learns a bunch of useful features, and then it also is sensitive to a bunch of useless features.
And that could be because of overfitting, could be because of finite sample error.
It doesn't matter what.
And then the intuition went, OK, well, now that you've learned a bunch of useful features and a bunch of useless features, an adversary can come in at test time.
They can change all of these useless features.
They won't have changed anything about the input, which is why it looks the same to us.
But the machine learning model that depended on these useless features is now going to be completely misled.
So that's the conceptual model going into this work.