Andrew Ilyas
๐ค SpeakerAppearances Over Time
Podcast Appearances
At the time, the space of attacks on adversarial examples was pretty much less mature than it is right now.
There were fast gradient sign method-based attacks, which was just take a single sign gradient step.
There were a couple of gradient-based attacks.
And this field of gradient-free or black box attacks was just emerging.
And the idea here is that, OK, if you're an adversary and you're trying to attack a production machine learning system, there's no way that that production machine learning system is going to be like, here are our model weights.
Take some gradients.
Go ahead.
What you usually have is an API.
some kind of thing where you can query it with an image and then it'll reply with, here are the labels.
And generally that threat model where you have only query access to a machine learning system was not sufficient to do adversarial attacks well.
So there were sort of two lines of work emerging as we started that black box adversarial attacks work.
One was on using adversarial transferability.
So the idea is that if you wanted to attack a production model, what you should do is sort of like train your own model locally, attack that model, and then deploy the attack or like deploy the attacked image and see what happens.
And so that was interesting, but sort of separate from what we were interested in.
We wanted to see if we could really attack the system directly.
And so there was one really nice workout at the time called, I think, Zoo, the zeroth order attack, something with zoo.
It was like a zeroth order optimization-based attack.
And what they basically did is they would estimate the gradient by component-wise by doing a zeroth-order gradient estimator, which is basically you start with your image.
For each pixel, you add epsilon.
You subtract epsilon.