# pwnies_please

Adversarial attack on an image classifier using the Fast Gradient Sign Method (FGSM)

Last updated

Adversarial attack on an image classifier using the Fast Gradient Sign Method (FGSM)

Last updated

Description

Disguise these pwnies to get the flag!

http://pwnies-please.chal.uiuc.tf

note: first solve gets $100 from ian (unintended solves don't count)

**author**: Anusha Ghosh, Akshunna Vaishnav, ian5v, Vanilla

Preface

Soon after the challenge was released, my teammate rainbowpigeon told me to take a look at it since it was an image classification AI challenge and I have a fair bit of experience with computer vision tasks.

I didn't have any prior experience in attacking AI models, but this turned out to be a really fun task. I ended up getting the $100 bounty for the first solve on this challenge (thanks Ian!)

I learnt a lot about how machine learning models can be vulnerable to adversarial attacks, and hopefully you can learn something from reading my writeup too!

Solution

The premise of the challenge was simple - we had to upload images to fool an image classification model, causing it to make inaccurate classifications.

Source Code Analysis

The "bouncer" is a ResNet-18 image classification model that classifies a given image as one of 10 classes ('plane', 'car', 'bird', 'cat', 'deer', 'dog', 'frog', 'horse', 'ship', 'truck'). This is the **non-robust** model that we have to fool, and we are given the model weights.

Another **robust** model, also using the ResNet-18 architecture, is used. This is meant to be the more accurate model, and serves as the ground truth.

The objective is to generate **adversarial examples** to fool the non-robust model into misclassifying the image as anything but a horse, while maintaining the "actual" class of the image so that the robust model still classifies the image as a horse.

Every time we fool the model successfully, our "level" goes up by 1. More than three consecutive failed attempts, however, will set our "level" back to 0. We need to fool the model successfully 50 times, within 5 minutes.

Additionally, `imagehash`

is used to compare the relative closeness of the submitted image to the original image. The goal is to make *tiny* changes to the original image, so that the non-robust model misclassifies the modified image.

Backpropagation in Machine Learning

To understand the attack, we need a bit of machine learning theory. Neural networks are loosely inspired by the human brain. Like how the human brain is made up of neurons, neural networks are made up of **nodes**. These nodes span across multiple **layers**, starting from the input layer and ending at the output layer.

The "learning" takes place when the weights are updated, thus placing different priorities on different connections. Intuitively, these weights are what determines how much influence a particular input feature has on the final output.

But in order for the model to learn, a **backward pass**, or **backpropagation**, must be performed. This might seem complicated, but it really isn't - it's just the chain rule!

*OK,* *some people* *won't be happy with the above statement, so maybe it's a little more subtle than that. You don't need to know this, but backpropagation is a special case of a technique known as automatic differentiation - as opposed to symbolic differentiation - which is a nice way of efficiently computing the derivative of a program with intermediate variables. I'll refer you to* *Justin Domke's notes* *for this.*

Using the chain rule, we calculate the sensitivity of the loss to each of the inputs. This is repeated (backpropagated) through each node in the network. It might help to look at this as an optimization problem where the chain rule and memoization are used to save work calculating each of the local gradients.

Gradient-Based Attacks

The Fast Gradient Sign Method (FGSM) does this by applying a small pertubation to the original data, in the direction of increasing loss.

Intuitively, we are "nudging" the input in the "wrong direction", causing the model to make less accurate predictions.

Exploitation

Back to the CTF challenge! We will implement the FSGM attack for this challenge.

First, we process the image, converting it to a Pytorch tensor and normalizing it.

We first perform a forward pass. The model predicts the class of the original image.

Now that we have the gradients, we want to calculate the adversarial example as follows:

In this case, I found that `eps = 0.02`

worked well enough (perturbations are small enough that the two images are similar, and the loss is significant enough that the model misclassifies the results)

We can then predict on the generated adversarial image to validate our results.

Let's visualize our results!

Great! After applying the perturbation, the model now thinks that the image is a dog.

Let's complete our `get_adverserial_example()`

function by saving the adversarial example as `sol.png`

.

All that's left now is the driver code for solving the CTF!

I used Python's requests library to automate the downloading of the original images and the uploading of the adversarial examples.

The success rate should be quite reliable!

After 50 successful misclassifications, we get the flag.

Full Solver Script

Postface

I'm done with the CTF writeup, and at this point, I'm just writing/rambling out of passion. I've never learnt about adversarial attacks before, so this is all very new and cool to me - if you're like me and want to know more, feel free to read on!

Why Should I Care?

This was just a CTF challenge, but there are plenty of real-life examples that highlight the severity of adversarial attacks.

For instance, these adversarial examples involve printed color stickers on road signs to fool DNN models used by self-driving cars - imagine causing an accident by simply placing a few stickers on stop signs!

One might be tempted to use a "person detector" in a physical intrusion detection mechanism. But as this paper shows, such models can be easily fooled by the person's clothing.

Bugs or Features?

Adversarial attacks and their defences are still a very active research topic. One paper argues that "Adversarial Examples Aren't Bugs, They're Features" - in brief, the researchers showed that the "non-robust" features imperceptible to humans might not be unnatural and meaningless, and are just as useful as perceptible "robust" ones in maximizing test-set accuracy.

When we make a small adversarial perturbation, we do not significantly affect the robust features, but flip the non-robust features. Since the model has no reason to prefer robust features over non-robust features, these seemingly small changes have a significant impact on the resulting output. When non-robust features are removed from the training set, it was found that robust models can be obtained with standard training.

Suppose an alien with no *human* concepts of "similarity". It might be confused why the original and final images should be identically classified. Remember, this alien perceives images in a completely different way from how humans do - it would spot patterns that humans are oblivious to, yet are extremely predictive of the image's class.

It is thus argued that "adversarial examples" is a purely human phenomenon - without any context about the physical world and human-related concepts of similarity, both robust and non-robust features should appear equally valid to a model. After all, what is "robust" and "non-robust" is purely considered from the human point of view - a model does not know to prioritize human-perceivable features over non-human-perceivable ones.

This is a really interesting perspective - if robustness is an inherent property of the dataset itself, then the solution to achieving human-meaningful outcomes fundamentally stems from eliminating non-robust features during training.

Nodes at each layer connect to nodes at the next layer. Each node represents some function $f(x, w)$ , where $x$ represents the input features and $w$ represents the **weight** of a node connection.

When neural networks make a prediction, a **forward pass** is performed. This simply means calculating the output, $y=f(x,w)$ of each node. At the end of the forward pass, a **loss function** $J$ calculates the error between our predicted output and the actual output.

This allows us to optimize the weights by performing a gradient descent. Intuitively, we want to find an optimal weight $w$ so that $J(w)$is minimized. To do this, we use the gradient calculated above - if $\frac{\partial{J}}{\partial{w}} <0$, we are going in the right direction. Otherwise, we have "overshot" our goal.

What if, when backpropagating, instead of treating $w$ as the variable we want to optimize, we look at the input $x$ instead? Instead of minimizing the loss by adjusting the weights based on the backpropagated gradients, the attack **adjusts the input data to maximize the loss** based on the same backpropagated gradients.

Next, we calculate the loss $J(x, y_{true})$, where $y_{true}$ corresponds to the ground truth (the label of 'horse'). Performing a backward pass then calculates the gradient of each variable.

$x_{adv}=x+\epsilon \text{sign}(\nabla_xJ(x,y_{true}))$

This applies a perturbation of magnitude $\epsilon$ in the direction of increasing loss.

The higher we set $\epsilon$, the less accurate the model will be **but** the perturbations become more easily perceptible. There is, therefore, a trade-off to consider between the relative closeness of the original and perturbed image, and the degree of accuracy degradation we can cause in the model's predictions.