My Challenges

2021

pwnies_please

Adversarial attack on an image classifier using the Fast Gradient Sign Method (FGSM)

Description

Disguise these pwnies to get the flag!

http://pwnies-please.chal.uiuc.tf

note: first solve gets $100 from ian (unintended solves don't count)

Preface

Soon after the challenge was released, my teammate rainbowpigeon told me to take a look at it since it was an image classification AI challenge and I have a fair bit of experience with computer vision tasks.

I didn't have any prior experience in attacking AI models, but this turned out to be a really fun task. I ended up getting the $100 bounty for the first solve on this challenge (thanks Ian!)

I learnt a lot about how machine learning models can be vulnerable to adversarial attacks, and hopefully you can learn something from reading my writeup too!

Solution

The premise of the challenge was simple - we had to upload images to fool an image classification model, causing it to make inaccurate classifications.

Source Code Analysis

The "bouncer" is a ResNet-18 image classification model that classifies a given image as one of 10 classes ('plane', 'car', 'bird', 'cat', 'deer', 'dog', 'frog', 'horse', 'ship', 'truck'). This is the **non-robust** model that we have to fool, and we are given the model weights.

Another **robust** model, also using the ResNet-18 architecture, is used. This is meant to be the more accurate model, and serves as the ground truth.

1

# ------------------ Model goes here β¬------------------ #

2

imagenet_class_index = ['plane', 'car', 'bird', 'cat', 'deer', 'dog', 'frog', 'horse', 'ship', 'truck']

3

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

4

β

5

model_nonrobust = models.resnet18()

6

num_ftrs = model_nonrobust.fc.in_features

7

model_nonrobust.fc = nn.Linear(num_ftrs, len(imagenet_class_index))

8

model_nonrobust.load_state_dict(torch.load("./models/pwny_cifar_eps_0.pth", map_location = device))

9

model_ft = model_nonrobust.to(device)

10

model_nonrobust.eval()

11

β

12

model_robust = models.resnet18()

13

num_ftrs = model_robust.fc.in_features

14

model_robust.fc = nn.Linear(num_ftrs, len(imagenet_class_index))

15

model_robust.load_state_dict(torch.load("./models/pwny_cifar_eps_0.5.pth", map_location = device))

16

model_ft = model_robust.to(device)

17

model_robust.eval()

18

β

19

image_set = torchvision.datasets.CIFAR10(root='static/images', train=False, download=True)

20

# ------------------ Model goes here ------------------ #

Copied!

The objective is to generate **adversarial examples** to fool the non-robust model into misclassifying the image as anything but a horse, while maintaining the "actual" class of the image so that the robust model still classifies the image as a horse.

1

nonrobust = get_prediction(image_bytes=img_bytes, model = model_nonrobust, curr_image = session['img'])

2

robust = get_prediction(image_bytes=img_bytes, model = model_robust, curr_image = session['img'])

3

# robust model is the "ground truth", non-robust is the "bouncer"

4

# cases:

5

# bouncer does not want to let in horses, you want to let them in anyway

6

# robust says horse, non-robust says horse: you have been detected

7

# robust says not horse, non-robust says horse: you fail extra hard

8

# robust says horse, non-robust says not horse: flag

9

# robust says not horse, non-robust says not horse: they were let in but you didn't achieve the goal

10

regen_image = True

11

if robust != 'horse':

12

response = "you snuck SOMETHING into the club but it wasn't a pwny (changed too much, ground truth thinks image is a: robust {}\tnonrobust {})".format(robust, nonrobust)

13

session['yolo'] += 1

14

elif robust == 'horse' and nonrobust != 'horse':

15

session['level'] = session['level'] + 1

16

session['yolo'] = 0

17

response = "success! the bouncer thought your horse was a: {}".format(nonrobust)

18

# response = "robust = {}, nonrobust = {}".format(robust, nonrobust)

19

else: # robust == 'horse' and nonrobust == 'horse':

20

response = "bouncer saw through your disguise. bouncer: rules say \"NO HORSEPLAY\""

21

session['yolo'] += 1

22

# response += "\nrobust {}\tnonrobust {}".format(robust, nonrobust)

23

# this is the most common fail condition

24

if session['yolo'] > 3:

25

session['yolo'] = 0

26

session['level'] = 0

27

response = "bouncer smacks you and you pass out, start over :)"

Copied!

Every time we fool the model successfully, our "level" goes up by 1. More than three consecutive failed attempts, however, will set our "level" back to 0. We need to fool the model successfully 50 times, within 5 minutes.

1

MIN_LEVEL = 50

2

SESSION_MINUTES = 5

3

β

4

...

5

β

6

if session['level'] >= MIN_LEVEL:

7

response = FLAG

Copied!

Additionally, *tiny* changes to the original image, so that the non-robust model misclassifies the modified image.

`imagehash`

is used to compare the relative closeness of the submitted image to the original image. The goal is to make 1

# Use imagehash to compare relative closeness of image (can't just allow random images to be thrown at the model...)

2

def get_prediction(image_bytes, model, curr_image = None):

3

inputs = transform_image(image_bytes=image_bytes)

4

outputs = model(inputs)

5

preds = torch.argmax(outputs, 1)

6

original = Image.open(io.BytesIO(base64.b64decode(curr_image)))

7

β

8

# "where the magic happens" - akshunna

9

input_image = Image.open(io.BytesIO(image_bytes))

10

hash_orig = imagehash.average_hash(original)

11

hash_input = imagehash.average_hash(input_image)

12

β

13

# currently HASH_DIFFERENCE is 5

14

# is number of bits changed in the hash

15

# hash is 64 bits long

16

# up to 5 hex digits can be different

17

# 16 hex digits

18

# 256b hash

19

# 0xffff ffff ffff ffff ffff ffff ffff ffff

20

if hash_orig - hash_input < HASH_DIFFERENCE:

21

return imagenet_class_index[preds]

22

else:

23

return "IMAGE WAS TOO DIFFERENT"

Copied!

Backpropagation in Machine Learning

To understand the attack, we need a bit of machine learning theory. Neural networks are loosely inspired by the human brain. Like how the human brain is made up of neurons, neural networks are made up of **nodes**. These nodes span across multiple **layers**, starting from the input layer and ending at the output layer.

Nodes at each layer connect to nodes at the next layer. Each node represents some function **weight** of a node connection.

$f(x, w)$

, where $x$

represents the input features and $w$

represents the The "learning" takes place when the weights are updated, thus placing different priorities on different connections. Intuitively, these weights are what determines how much influence a particular input feature has on the final output.

When neural networks make a prediction, a **forward pass** is performed. This simply means calculating the output, **loss function**

$y=f(x,w)$

of each node. At the end of the forward pass, a $J$

calculates the error between our predicted output and the actual output.But in order for the model to learn, a** backward pass**, or **backpropagation**, must be performed. This might seem complicated, but it really isn't - it's just the chain rule!

Using the chain rule, we calculate the sensitivity of the loss to each of the inputs. This is repeated (backpropagated) through each node in the network. It might help to look at this as an optimization problem where the chain rule and memoization are used to save work calculating each of the local gradients.

This allows us to optimize the weights by performing a gradient descent. Intuitively, we want to find an optimal weight

$w$

so that $J(w)$

is minimized. To do this, we use the gradient calculated above - if $\frac{\partial{J}}{\partial{w}} <0$

, we are going in the right direction. Otherwise, we have "overshot" our goal.Gradient-Based Attacks

What if, when backpropagating, instead of treating **adjusts the input data to maximize the loss **based on the same backpropagated gradients.

$w$

as the variable we want to optimize, we look at the input $x$

instead? Instead of minimizing the loss by adjusting the weights based on the backpropagated gradients, the attack The Fast Gradient Sign Method (FGSM) does this by applying a small pertubation to the original data, in the direction of increasing loss.

Intuitively, we are "nudging" the input in the "wrong direction", causing the model to make less accurate predictions.

Exploitation

Back to the CTF challenge! We will implement the FSGM attack for this challenge.

First, we process the image, converting it to a Pytorch tensor and normalizing it.

1

def get_adverserial_example(original_image):

2

preprocess = transforms.Compose([

3

transforms.ToTensor(),

4

transforms.Normalize(

5

[0.485, 0.456, 0.406],

6

[0.229, 0.224, 0.225])])

7

8

img = Image.open(original_image)

9

image_tensor = preprocess(img)

10

image_tensor = image_tensor.unsqueeze(0)

11

img_variable = Variable(image_tensor, requires_grad=True)

Copied!

We first perform a forward pass. The model predicts the class of the original image.

1

output = model_nonrobust.forward(img_variable)

2

label_idx = torch.max(output.data, 1)[1][0]

3

β

4

x_pred = imagenet_class_index[label_idx]

5

β

6

output_probs = F.softmax(output, dim=1)

7

x_pred_prob = (torch.max(output_probs.data, 1)[0][0]) * 100

Copied!

Next, we calculate the loss

$J(x, y_{true})$

, where $y_{true}$

corresponds to the ground truth (the label of 'horse'). Performing a backward pass then calculates the gradient of each variable.1

y_true = 7

2

target = Variable(torch.LongTensor([y_true]), requires_grad=False)

3

β

4

loss = torch.nn.CrossEntropyLoss()

5

loss_cal = loss(output, target)

6

loss_cal.backward(retain_graph=True)

Copied!

Now that we have the gradients, we want to calculate the adversarial example as follows:

$x_{adv}=x+\epsilon \text{sign}(\nabla_xJ(x,y_{true}))$

This applies a perturbation of magnitude

$\epsilon$

in the direction of increasing loss. The higher we set **but **the perturbations become more easily perceptible. There is, therefore, a trade-off to consider between the relative closeness of the original and perturbed image, and the degree of accuracy degradation we can cause in the model's predictions.

$\epsilon$

, the less accurate the model will be In this case, I found that

`eps = 0.02`

worked well enough (perturbations are small enough that the two images are similar, and the loss is significant enough that the model misclassifies the results)1

eps = 0.02

2

x_grad = torch.sign(img_variable.grad.data)

3

x_adversarial = img_variable.data + eps * x_grad

Copied!

We can then predict on the generated adversarial image to validate our results.

1

output_adv = model_nonrobust.forward(Variable(x_adversarial))

2

x_adv_pred = imagenet_class_index[torch.max(output_adv.data, 1)[1][0]]

3

op_adv_probs = F.softmax(output_adv, dim=1)

4

adv_pred_prob = (torch.max(op_adv_probs.data, 1)[0][0]) * 100

Copied!

Let's visualize our results!

1

def visualize(x, x_adv, x_grad, epsilon, clean_pred, adv_pred, clean_prob, adv_prob):

2

3

x = x.squeeze(0) # remove batch dimension # B X C H X W ==> C X H X W

4

x = x.mul(torch.FloatTensor(std).view(3,1,1)).add(torch.FloatTensor(mean).view(3,1,1)).numpy()# reverse of normalization op- "unnormalize"

5

x = np.transpose( x , (1,2,0)) # C X H X W ==> H X W X C

6

x = np.clip(x, 0, 1)

7

8

x_adv = x_adv.squeeze(0)

9

x_adv = x_adv.mul(torch.FloatTensor(std).view(3,1,1)).add(torch.FloatTensor(mean).view(3,1,1)).numpy()# reverse of normalization op

10

x_adv = np.transpose( x_adv , (1,2,0)) # C X H X W ==> H X W X C

11

x_adv = np.clip(x_adv, 0, 1)

12

13

x_grad = x_grad.squeeze(0).numpy()

14

x_grad = np.transpose(x_grad, (1,2,0))

15

x_grad = np.clip(x_grad, 0, 1)

16

17

figure, ax = plt.subplots(1,3, figsize=(18,8))

18

ax[0].imshow(x)

19

ax[0].set_title('Clean Example', fontsize=20)

20

21

22

ax[1].imshow(x_grad)

23

ax[1].set_title('Perturbation', fontsize=20)

24

ax[1].set_yticklabels([])

25

ax[1].set_xticklabels([])

26

ax[1].set_xticks([])

27

ax[1].set_yticks([])

28

β

29

30

ax[2].imshow(x_adv)

31

ax[2].set_title('Adversarial Example', fontsize=20)

32

33

ax[0].axis('off')

34

ax[2].axis('off')

35

β

36

ax[0].text(1.1,0.5, "+{}*".format(round(epsilon,3)), size=15, ha="center",

37

transform=ax[0].transAxes)

38

39

ax[0].text(0.5,-0.13, "Prediction: {}\n Probability: {}".format(clean_pred, clean_prob), size=15, ha="center",

40

transform=ax[0].transAxes)

41

42

ax[1].text(1.1,0.5, " = ", size=15, ha="center", transform=ax[1].transAxes)

43

β

44

ax[2].text(0.5,-0.13, "Prediction: {}\n Probability: {}".format(adv_pred, adv_prob), size=15, ha="center",

45

transform=ax[2].transAxes)

46

47

β

48

plt.show()

Copied!

Great! After applying the perturbation, the model now thinks that the image is a dog.

Let's complete our

`get_adverserial_example()`

function by saving the adversarial example as `sol.png`

.1

x_adv = x_adversarial

2

x_adv = x_adv.squeeze(0)

3

x_adv = x_adv.mul(torch.FloatTensor(std).view(3,1,1)).add(torch.FloatTensor(mean).view(3,1,1)).numpy()#reverse of normalization op

4

x_adv = np.transpose( x_adv , (1,2,0)) # C X H X W ==> H X W X C

5

x_adv = np.clip(x_adv, 0, 1)

6

β

7

plt.imsave('sol.png', x_adv)

8

β

9

test_image = Image.open('sol.png').convert('RGB')

10

test_image.save('sol.png')

Copied!

All that's left now is the driver code for solving the CTF!

I used Python's requests library to automate the downloading of the original images and the uploading of the adversarial examples.

1

def main():

2

β

3

s = requests.session()

4

r = s.get('http://pwnies-please.chal.uiuc.tf/')

5

print('Cookies:', r.cookies)

6

β

7

curr_level = 0

8

fail_count = 0

9

while curr_level < 50 and 'uiuctf' not in r.text:

10

β

11

print('Current Level:', curr_level)

12

β

13

match = re.search('<img class="show" src="data:image/png;base64,(.+)"/>', r.text)

14

img_data = base64.b64decode(match[1])

15

filename = 'original_img.png'

16

β

17

with open(filename, 'wb') as f:

18

f.write(img_data)

19

20

get_adverserial_example(filename)

21

β

22

files = {'file': open('sol.png','rb')}

23

β

24

r = s.post('http://pwnies-please.chal.uiuc.tf/?', files=files)

25

β

26

if 'success' in r.text:

27

print('[+] Success')

28

curr_level += 1

29

fail_count = 0

30

31

else:

32

print('[-] Failure')

33

fail_count += 1

34

35

if fail_count > 3:

36

curr_level = 0

37

fail_count = 0

38

β

39

print('[+] Attack successful!')

40

r = s.get('http://pwnies-please.chal.uiuc.tf/')

41

print(r.text)

42

β

43

main()

Copied!

The success rate should be quite reliable!

After 50 successful misclassifications, we get the flag.

Full Solver Script

Postface

I'm done with the CTF writeup, and at this point, I'm just writing/rambling out of passion. I've never learnt about adversarial attacks before, so this is all very new and cool to me - if you're like me and want to know more, feel free to read on!

Why Should I Care?

This was just a CTF challenge, but there are plenty of real-life examples that highlight the severity of adversarial attacks.

For instance, these adversarial examples involve printed color stickers on road signs to fool DNN models used by self-driving cars - imagine causing an accident by simply placing a few stickers on stop signs!

One might be tempted to use a "person detector" in a physical intrusion detection mechanism. But as this paper shows, such models can be easily fooled by the person's clothing.

Bugs or Features?

Adversarial attacks and their defences are still a very active research topic. One paper argues that "Adversarial Examples Aren't Bugs, They're Features" - in brief, the researchers showed that the "non-robust" features imperceptible to humans might not be unnatural and meaningless, and are just as useful as perceptible "robust" ones in maximizing test-set accuracy.

When we make a small adversarial perturbation, we do not significantly affect the robust features, but flip the non-robust features. Since the model has no reason to prefer robust features over non-robust features, these seemingly small changes have a significant impact on the resulting output. When non-robust features are removed from the training set, it was found that robust models can be obtained with standard training.

Suppose an alien with no *human* concepts of "similarity". It might be confused why the original and final images should be identically classified. Remember, this alien perceives images in a completely different way from how humans do - it would spot patterns that humans are oblivious to, yet are extremely predictive of the image's class.

It is thus argued that "adversarial examples" is a purely human phenomenon - without any context about the physical world and human-related concepts of similarity, both robust and non-robust features should appear equally valid to a model. After all, what is "robust" and "non-robust" is purely considered from the human point of view - a model does not know to prioritize human-perceivable features over non-human-perceivable ones.

This is a really interesting perspective - if robustness is an inherent property of the dataset itself, then the solution to achieving human-meaningful outcomes fundamentally stems from eliminating non-robust features during training.

Last modified 5mo ago