Bypassing ten adversarial detection methods (Carlini and Wagner, 2017)

Some rough notes of this paper.

Paper link: Adversarial examples are not easily detected: Bypassing ten detection methods. (Carlini and Wagner, 2017)

In brief: Adversarial defences are often flimsy. The authors are able to bypass ten detection methods for adversarial examples. They do so with both black-box and white-box attacks. The C&W attack is the main attack used. The most promising defence evaluated the classification uncertanity of each image through generating randomised models.

Scenarios Link to heading

The authors tried three different scenarios. Each scenario depends on the knowledge of the adversary.

Zero-knowledge adversary: the attacker isn’t aware there is a detector in place. Generate adversarial examples with the C&W attack and then test the defence.
Perfect-knowledge adversary (white-box attack): The attacker knows a detector is in place, knows the type of detector, knows the model parameters used in the detector, and has access to the training data. The difficult thing is to construct a loss function to generate adversarial examples.
Limited-knowledge adversary (black-box attack): The attacker knows there is a detector in place, knows what type of detector it is, but doesn’t know the parameters of the detector, and doesn’t have access to the training data. The attacker first trains a substitute model on a seperate training set in the same way as the original model was trained. They know the parameters of this model, and can generate adversarial examples with a white-box attack. The adversarial examples are then tested on the original model.

Attacks Link to heading

They used one main method of attack: the L2 based C&W attack.

Other papers use JSMA or Fast Gradient Sign attacks to test their defences. These are not strong attacks. JSMA is described in The Limitations of Deep Learning in Adversarial Settings and Fast Gradient Sign in Explaining and Harnessing Adversarial Examples.

Detectors Link to heading

Ten different detectors were tested.

Three of these detectors added a second network for detection. Three detectors relied on PCA to detect adversarial examples. Two detectors used other statistical methods to distinguish adversarial examples, comparing the distribution of natural images to the distribution of adversarial examples. Two detectors rely on input normalisation with randomisation and blurring.

Grosse et al and Gong et al train a second network to try and classify adversarial examples as their own class. Gong et al use a binary classifier, Grosse et al use a multi-class classifier. (On the (Statistical) Detection of Adversarial Examples, Grosse et al, 2017) (Adversarial and Clean Data are not Twins, Gong et al, 2017)
Metzen et al train a second neural network on the intermediate layers of a classification (ResNet) network. (On Detecting Adversarial Perturbations , Metzen et al, 2017)
Hendrycks & Gympel find differences in the PCA reduction between adversarial and natural examples, and use this as the basis of their defence. The key problem is that the difference in PCA is an artefact of the MNIST dataset. Other datasets show no difference in PCA between adversarial and regular examples. ( Early Methods for Detecting Adversarial Images, Hendrycks & Gympel, 2017)
Bhagoji et al first use PCA to reduce the dimension of the training data, and then afterwards train a classifier on the data with reduced dimensionality. (Dimensionality Reduction as a Defense against Evasion Attacks on Machine Learning Classifier , Bhagoji et al, 2017)
Li et al. use a cascade classifier to detect adversarial examples. A cascade classifier is a series of classifiers, where each classifier acts on a different layer of the convolutional network. They transform each convolutional layer with PCA, and then use a linear SVM as the classifiers. (Adversarial Examples Detection in Deep Networks with Convolutional Filter Statistics, Li et al, 2016)
Grosse et al. use an approximation to the Maximum Mean Discrepancy test to compare adversarial and natural distributions. (On the (Statistical) Detection of Adversarial Examples, Grosse et al, 2017)
Feinman et al train a Gaussian Mixture Model on the final hidden layer of a neural network, the idea being adversarial examples belong to a different distribution to natural images. (Detecting Adversarial Samples from Artifacts, Feinman et al., 2017)
Feinman et al. use a method called Bayesian neural network uncertainty. The idea is that after adding randomisation (with Dropout layers) the network will be more uncertain on adversarial examples than on natural examples. Generate a number of randomised neural networks and record the number of correct image classifications. The hope is natural images have the same label more often than adversarial labels. This was the only defence that was somewhat effective. (Detecting Adversarial Samples from Artifacts, Feinman et al., 2017)
Li et al. blur the image (with a 3x3 average convolutional layer) before applying the classifier. (Adversarial Examples Detection in Deep Networks with Convolutional Filter Statistics, Li et al, 2016)

Lessons Link to heading

Randomisation can increase the amount of distortion required for a successful adversarial example. This is a promising direction.
Many defences to adversarial attacks are demonstrated on the MNIST dataset. These defences often fail on CIFAR, and hence on many other datasets. Defences should be tested on more datasets than just MNIST.
Defences based around a second detection neural networks seem to be easy to fool. Adversarial examples can fool one neural network, and a second one doesn’t provide much more challenge.
Defences operating on raw pixel values aren’t effective. They might work against simple attacks, but not against more complex ones.

Recommendations Link to heading

Use a strong attack for evaluation, like C&W. Don’t just use the fast gradient-sign method or JSMA.
Use a few datasets for evaluation.
Show that white-box attacks don’t work for your defence. Doing just black-box attacks isn’t enough.
Report false-positive and true-negative rates, and ROC curves if possible. Accuracy isn’t enough: the same accuracy values can be either useful or not useful. A low false-positive rate is good. You’d rather detect all adversarial examples correctly and miss some, than detect a lot of natural images as adversarial and get all adversarial examples.

Scenarios Link to heading

Attacks Link to heading

Detectors Link to heading

Lessons Link to heading

Recommendations Link to heading

Further reading Link to heading