Andrew Ilyas
๐ค SpeakerAppearances Over Time
Podcast Appearances
And what we really wanted to do was test this out.
And so we designed this really simple experiment.
We took a data set, just a standard image classification data set, and we took the training set, and we took a fixed model, and we made an adversarial example out of every single sample in the training set.
So what that means is that, let's say, for simplicity, let's say you just have a dogs versus cats training set.
Every cat you adversarially perturb to become a dog, every dog you adversarially perturb to become a cat.
And so the result is this data set of cats that have been slightly perturbed to be classified as dogs and dogs that have been slightly perturbed to be classified as cats.
Now, once you've done that, what we're going to do is we're going to relabel
the data according to the adversarial class.
So the resulting dataset looked completely misclassified to a human.
You just have a bunch of cats labeled as dogs and dogs labeled as cats.
Of course, these are all slightly perturbed images.
We then throw away the original model, and we throw away the original data set, and we just train a new model on this data set of mislabeled dogs and mislabeled cats.
And the question is, what's going to happen with this new model if we test it on the original data distribution?
So just clean pictures of cats labeled as cats and dogs labeled as dogs.
And if you think back to this useful versus useless features dichotomy, you should basically only expect two things.
Either the classifier will get 0% because you've trained it on a bunch of cats labeled as dogs and dogs labeled as cats, and you're testing it on correctly classified images.
Or maybe you're slightly more skeptical and you're like, well, actually, I think it could get 50%.
Because it's possible that by making these adversarial examples in the training set, you're actually introducing some variation along the useless features.
And then the model might just cling to that useless variation and then get 50% on the clean data.
The interesting thing, though, is that when you train a classifier on this entirely mislabeled data set, that classifier gets 90% accuracy.