JagSkap: Input Manipulation Attacks on ML Models : Using FGSM

Overview of OWASP Top 10 ML & LLM Security Checklist
Understanding Attack Surfaces in AI Systems

Adversarial Attacks

ML01:2023 - Input Manipulation Attack
ML08:2023 - Model Skewing
ML07:2023 - Transfer Learning Attack
ML09:2023 - Output Integrity Attack

Github Link: https://github.com/RihaMaheshwari/AIML-LLM-Security/

Hey Techies! Input Manipulation Attack is one of the vulnerabilities in the OWASP Top 10 (Machine Learning), and today, let us understand more about this vulnerability. In this blog, we’ll walk through a basic understanding of how an Input Manipulation Attack is performed on an ML model, along with a practical demonstration. But before diving in, we need to get clear on the basics of Input Manipulation. Note: If you haven't explored Adversarial Attacks yet, check out my previous blog [Link].

What is Input Manipulation Attack?

Input Manipulation Attacks in AIML refer to adversarial techniques where an attacker alters input data to cause a machine learning model to make incorrect predictions. These modifications are often subtle enough to be imperceptible to human observers but can completely fool the ML model. This can involve injecting malicious inputs, modifying feature values, or crafting adversarial examples that mislead the model into making incorrect predictions.

Impact

The impact of Input Manipulation Attacks can be significant, leading to compromised model performance, security vulnerabilities, and faulty decision-making. Attackers can manipulate inputs to bypass detection in security systems, triggering false negatives in threat detection models. In high-stakes applications like fraud detection or autonomous systems, such attacks could result in financial losses, safety risks, or unauthorized access. Additionally, these manipulated inputs can introduce biases, undermining trust in AI-driven decisions and potentially affecting broader systems and users.

Scenario #1: Input Manipulation of Image Classification System

Imagine a deep learning model is trained to recognize different objects, like apples and oranges. When you show them orange picture, it recognizes them and is correct. Now as an attacker you want that the model predictions should be wrong.

An attacker would slightly change a picture of an apple in a way that’s barely noticeable to humans. Now when uploaded to the model, these tiny modifications trick the model into thinking the apple is actually an orange cause a machine learning model to make incorrect predictions.

That's exactly what we’re going to see in this demo—although, just to be clear, we’re not using apples or oranges here! Note: This is an example of a White-box Attack.

Demo

I’ve set up a website that allows users to upload an image. The application serves a pre-trained image classification model MobileNetV2 (e.g., a model trained to classify images as apple or oranges). However, there is no input validation or security measures in place to prevent adversarial inputs.

Step 1: First, navigate to the website. In my case, it's `http://jagskap:5000/` as shown in Exhibit 1.

Exhibit 1

Step 2: I downloaded an image of an umbrella from the internet. Let’s upload it now and check if the model can predict that it’s an umbrella. As shown in Exhibit 2, the model shows that it's 99% confident that the uploaded image is an umbrella.

Exhibit 2

Step 3: In this step, we’ll carry out an adversarial attack. The goal is to modify the image of the umbrella so that it still looks like an umbrella to us, but with subtle changes that trick the model into misclassifying it. To do this, we’ll use an adversarial tool, specifically the FGSM (Fast Gradient Sign Method) technique.

Since this is a white-box attack, the attacker would need the following:

Model Architecture: The attacker needs to know the model used, in this case (MobileNetV2) and how it processes data, including its layers and structure.
Model Weights: The attacker must have access to the pre-trained weights (from ImageNet, in this case) to compute gradients.
Input Preprocessing: The attacker must apply the same preprocessing steps (such as resizing and normalization) as the model (i.e., mobilenet_v2.preprocess_input).
Loss Function: The attacker needs to understand the loss function (e.g., sparse_categorical_crossentropy) in order to calculate the gradient for the adversarial attack.
Target Label: The attacker selects a target label as 1 (such as "umbrella") to mislead the model.

To generate an adversarial image using the FGSM (Fast Gradient Sign Method):

Fetch Image: The attacker sends an image through the website to get the model's prediction.
Calculate Gradients: Using tf.GradientTape(), the attacker computes the loss and its gradient with respect to the input image.
Apply Perturbation: The attacker modifies the image by adding a small noise vector, based on the gradient, controlled by epsilon.
Generate Adversarial Image: The modified image is saved, and it causes the model to make an incorrect prediction.

Here’s how an attacker might alter the code to create an adversarial example after receiving the image:

Exhibit 3

Step 4: Now, let’s run the code, and we should see an adversarial_image.jpg generated, shown in Exhibit 4.

Exhibit 4

Step 5: Now, let’s upload this adversarial image to the model and check the response.

Exhibit 5

Bingo!! As you can see in Exhibit 5, the model failed to correctly identify the image after the adversarial attack. This demonstrates how subtle changes to an image can mislead even a well-trained machine learning model.

Defense Strategies

To mitigate the risk of adversarial attacks in this context, several defense techniques can be employed:

Adversarial Training: Incorporate adversarial examples into the training process, so the model learns to be more robust to such attacks.
Input Validation: Implement methods to detect and filter out adversarial inputs before they reach the model (e.g., using image classifiers or statistical methods).
Regularization: Use regularization techniques that help the model generalize better and become less sensitive to minor changes in the input.
Model Distillation: Use techniques that "distill" the model into a simpler, more robust model that may be less vulnerable to adversarial examples.

Conclusion

Ultimately, input manipulation attacks highlight the importance of securing machine learning models, particularly in production environments, to prevent malicious actors from exploiting these vulnerabilities.

I hope you found this blog insightful and enjoyable. If you did, please share it with others and leave a comment with your thoughts or questions!

Sunday, February 2, 2025

Input Manipulation Attacks on ML Models : Using FGSM