AI/ML Security Laboratory

AI/ML Security & Privacy Overview

This AI/ML Security & Privacy Laboratory provides educational resources and practical implementations of various attacks and defenses described in the CyBOK Security and Privacy of AI Knowledge Guide.

As machine learning systems become increasingly integrated into critical applications, understanding their security vulnerabilities is essential for building robust AI systems.

AI Security Landscape

Figure 1.1: Overview of AI/ML security threats, attack vectors, and impact areas.

The AI/ML security landscape represents a complex ecosystem where various threats target machine learning models throughout their lifecycle. From data collection to deployment, each stage presents unique vulnerabilities that malicious actors can exploit. These threats don't just compromise the models themselves but can lead to serious consequences including confidentiality breaches, integrity compromises, intellectual property theft, and privacy violations.

Machine learning security is characterized by a diverse range of attack vectors, each with different techniques and impacts. Evasion attacks manipulate input data to cause misclassification without altering the model itself. Poisoning attacks contaminate training data to degrade performance or introduce specific vulnerabilities. Backdoor attacks embed hidden patterns that can be triggered later to cause targeted misbehavior, while realizable attacks focus on implementing adversarial techniques in real-world, physical environments despite practical constraints.

ML Pipeline Vulnerabilities

Data Collection & Preprocessing
Model Training & Validation
Model Deployment & Serving
Inference & Decision-making

Threat Actors & Capabilities

Perfect-Knowledge (transparent-box)
Partial-Knowledge (translucent-box)
Zero-Knowledge (opaque-box)
Active vs. Passive Attacks

Security concerns affect the entire machine learning pipeline. During data collection and preprocessing, adversaries might poison datasets or manipulate labels. The training phase is vulnerable to backdoor injections and hyperparameter manipulation. Once deployed, models face model theft attempts, adversarial examples, and exploitation of transfer learning weaknesses. Even at inference time, attackers can manipulate confidence scores, exploit decision boundaries, or generate misleading explanations.

The severity of these threats depends significantly on the attacker's knowledge and capabilities. Transparent-box attackers possess complete information about the model and its training, representing the most dangerous scenario. Translucent-box attackers have partial knowledge, while opaque-box attackers must rely on observations of the model's outputs. Understanding these threat models is crucial for implementing appropriate defense strategies that address realistic adversarial capabilities.

Environment Setup

We've prepared a simple Python script to help you set up the complete environment for AI/ML security testing. This script will create all necessary directories and install the required dependencies to ensure a seamless experience with the lab environment.

Setup Requirements

No special setup is needed beyond a basic Python installation and internet connection. The script will handle installing all required libraries and creating the necessary directory structure.

Setup Steps

Download the setup.sh file from the resources section below
Open a terminal or command prompt
Navigate to the directory where you saved the script
Run the script: ./setup.sh
Wait for the script to create directories and install dependencies
then run ./start_lab.sh to make laboratory dependencies setup
Begin exploring the AI/ML security testing environment!

Video Demonstration

Video demonstration of environment setup process

Resources

Download Setup Script CyBOK Security and Privacy of AI Knowledge Guide Repository: AI/ML Privacy & Security Project Readme File

Need Help?

If you encounter any issues during setup, please check the troubleshooting guide or readme file in the Project Repository & Follow the Instructions carefully to ensure a smooth setup process.

Students:

The AI/ML Security & Privacy Lab offers a hands-on environment where you can delve into the critical security and privacy challenges of AI/ML. We encourage you to use this lab's materials to learn and explore real-world attacks, analyse and modify code, practice ethical hacking, detect vulnerabilities, and implement effective mitigations. By actively engaging with these materials, you will strengthen AI/ML models against threats and develop the expertise to secure future AI systems.

Educators:

The AI/ML Security & Privacy Lab provides a comprehensive set of code, images, and documents to enrich your AI/ML security and privacy curriculum. We encourage you to adapt these resources, e.g., by removing sections of code for students to complete or tailoring scenarios to your course needs—to create engaging lectures and lab sessions. By leveraging our practical, interactive materials, you will convey complex concepts more effectively and inspire students through hands-on learning. Please contact the project's lead if you have any questions and/or suggestions.

CyBOK © Crown Copyright, The National Cyber Security Centre 2025, licensed under the Open Government Licence http://www.nationalarchives.gov.uk/doc/open-government-licence. When you use this information under the Open Government Licence, you should include the following attribution:
CyBOK Materials for AI for Security © Crown Copyright, The National Cyber Security Centre 2025, licensed under the Open Government Licence: http://www.nationalarchives.gov.uk/doc/open-government-licence.

Attack Types

Machine learning models are vulnerable to various types of attacks targeting different aspects of the ML pipeline. Below are the main categories of attacks implemented in our laboratory:

Evasion Attacks

Adversarial examples that cause misclassification at inference time.

Poisoning Attacks

Corrupting training data to degrade model performance.

Backdoor Attacks

Embedding hidden functionality triggered by specific patterns.

Realizable Attacks

Physical-world attacks that respect real-world constraints.

Model Stealing

Extracting model functionality or architecture through API access.

Privacy Attacks

Extracting sensitive information from models or training data.

Machine learning models, despite their impressive capabilities, are susceptible to a wide range of security threats that exploit different stages of the ML lifecycle. These attacks can compromise model integrity, confidentiality, and availability. In our laboratory, we have implemented and analyzed several such attacks to understand their impact and develop effective countermeasures. The primary categories of attacks include evasion attacks, which manipulate inputs at inference time to induce incorrect predictions; poisoning attacks, which tamper with training data to corrupt the learning process; and backdoor attacks, which insert hidden behaviors triggered by specific inputs. Other attack types include realizable attacks, which are constrained by physical-world feasibility; model stealing, where adversaries aim to replicate or reverse-engineer the model via exposed APIs; and privacy attacks, which attempt to extract sensitive training data or attributes. The following sections will elaborate on each of these attacks in greater detail, highlighting their mechanisms, implications, and potential defenses

Attack Implementation Examples[.zip]

CyBOK © Crown Copyright, The National Cyber Security Centre 2025, licensed under the Open Government Licence http://www.nationalarchives.gov.uk/doc/open-government-licence. When you use this information under the Open Government Licence, you should include the following attribution:
CyBOK Materials for AI for Security © Crown Copyright, The National Cyber Security Centre 2025, licensed under the Open Government Licence: http://www.nationalarchives.gov.uk/doc/open-government-licence.

Demonstrations

Explore our video demonstrations showcasing various aspects of AI/ML security testing. Each video provides practical insights into the implementation and execution of different security techniques covered in this laboratory.

Lab Setup Demonstration

Step-by-step guide to setting up a complete AI/ML security testing laboratory with all required tools and configurations.

Attack Demonstration

Detailed walkthrough of evasion, poisoning, and backdoor attacks against machine learning models with practical examples.

Defense Demonstration

Implementation of defensive techniques including adversarial training, data sanitization, and differential privacy.

Environment Setup Demonstration

Configuring virtual environments, installing dependencies, and setting up development workflows for AI security testing.

AI-Driven Attack Demonstration

Showcasing advanced attacks that use AI to automatically generate adversarial examples and identify model vulnerabilities.

Pentesting Mitigtation Toolkit Demonstration

A tutorial demonstrating how to use the AI/ML Security Penetration Testing Toolkit for identifying, exploiting, and mitigating vulnerabilities in machine learning systems.

Evasion Attacks

Evasion attacks occur during the inference phase of machine learning models, where adversaries craft specialized inputs that force misclassification despite appearing nearly identical to legitimate data. These sophisticated attacks exploit vulnerabilities in the model's decision boundaries through optimized perturbations that remain typically imperceptible to human observers.

Evasion Attack Mechanism

Figure 2.1.1: Adversarial perturbations shift inputs across decision boundaries, causing systematic misclassification while maintaining visual similarity.

Evasion attacks are a class of test-time adversarial attacks that occur during the inference phase of machine learning models. In these attacks, adversaries craft subtle but deliberately manipulated inputs—known as adversarial examples—that are designed to force the model into making incorrect predictions. What makes evasion attacks particularly dangerous is that these perturbations are often imperceptible to the human eye, yet they exploit vulnerabilities in the model’s learned decision boundaries. This manipulation effectively causes the input to cross into a different classification region, resulting in systematic misclassification without noticeably altering the data from a human perspective.

The general mechanism behind evasion attacks involves finding a small perturbation δ to the original input x such that the model's prediction f(x + δ) differs from f(x), while keeping the perturbation small enough to avoid detection. Attackers typically aim to minimize the size of δ using various norms—L0, L2, or L∞—to constrain the nature of the perturbation. Different attack strategies have emerged based on this core idea. Notable among them is the Fast Gradient Sign Method (FGSM), a fast and simple approach that perturbs the input in the direction of the gradient of the loss function. By adding ε·sign(∇ₓL), FGSM creates an adversarial example that is computationally cheap to generate but may produce noticeable perturbations.

Fast Gradient Sign Method (FGSM)

FGSM represents a groundbreaking single-step attack approach that calculates optimal perturbations by analyzing the gradient of the loss function. This method revolutionized adversarial machine learning by demonstrating that surprisingly simple mathematical operations could deceive even sophisticated deep learning models. The attack's elegance lies in its efficiency—computing the sign of the gradient determines the direction that maximally increases the model's error while the epsilon parameter controls perturbation magnitude.

Figure 2.1.2: FGSM Attack Block Daigram..
Source: Created based on Goodfellow et al. (2015).

Projected Gradient Descent (PGD)

PGD extends FGSM into an iterative framework that progressively refines adversarial perturbations through multiple small steps. This sophisticated approach generates more robust adversarial examples by repeatedly applying gradient-based updates while projecting back onto a constrained perturbation space. The method balances between attack strength and perceptual similarity, making it particularly valuable for evaluating model robustness in realistic security scenarios.

Figure 2.1.3: PGD Attack Block Daigram.
Source: Created based on Madry et al. (2018).

Building upon FGSM, the Projected Gradient Descent (PGD) attack adds iterative refinement to increase the strength and precision of the attack. PGD repeatedly applies small gradient-based updates and projects the resulting input back into an ε-ball around the original input to ensure the perturbation remains bounded. This iterative process generates more robust adversarial examples with better success rates while maintaining perceptual similarity.

Carlini & Wagner (C&W) Attack

The C&W attack represents the pinnacle of optimization-based adversarial techniques, formulating the perturbation problem as a sophisticated constrained optimization challenge: minimize ‖δ‖_p + c · f(x + δ). This approach discovers minimal perturbations through advanced optimization algorithms, consistently bypassing common defense mechanisms by optimizing jointly for attack success and imperceptibility. The method's careful balance of objectives enables it to generate highly effective adversarial examples while minimizing detectability.

Figure 2.1.4: Carlini & Wagner (C&W) Attack Block Daigram.
Source: Created based on Carlini & Wagner (2017).

Comparative Analysis of Attack Methods

Figure 2.1.5: Visual comparison of adversarial examples generated by FGSM, PGD, and C&W attacks showing perturbation characteristics and effectiveness.
Source: Created based on Goodfellow et al. (2015); Madry et al. (2018); Carlini & Wagner (2017).

Comparative analysis of these attack methods highlights a trade-off between computational cost and attack effectiveness. FGSM is computationally efficient but less effective for well-defended models, making it suitable for quick testing. PGD offers a good balance, delivering stronger attacks at a moderate computational cost. In contrast, C&W delivers minimal and highly targeted perturbations at the cost of significant computation, making it ideal for high-stakes adversarial scenarios.

To understand and analyze evasion attacks, it’s crucial to grasp certain foundational concepts. The epsilon (ε) parameter controls the allowed perturbation magnitude, serving as a balance between stealth and strength. Decision boundaries in the model's feature space determine class regions, and adversarial perturbations aim to push inputs across these boundaries. Various Lp norms (L0, L2, L∞) define how perturbation size is measured, influencing how attacks are constructed and evaluated. Overall, evasion attacks expose the fragility of even state-of-the-art machine learning models and highlight the importance of developing more robust and secure AI systems.

Each attack method presents unique trade-offs between computational efficiency and attack effectiveness:

FGSM

The fastest approach, executing in a single step with larger but more detectable perturbations. Ideal for quick vulnerability assessments and baseline testing.

PGD

Iterative refinement produces stronger attacks with medium-sized perturbations. Offers the best balance between computation time and attack success rate.

C&W

The most sophisticated method, generating minimal perturbations that evade most defenses. Computationally intensive but highly effective.

Advanced Concepts in Evasion Attacks

Understanding evasion attacks requires grasping several key concepts that define their behavior and effectiveness. The epsilon parameter determines the maximum allowed perturbation magnitude, essentially setting the boundary between imperceptibility and attack strength. Decision boundaries represent the high-dimensional surfaces that separate different classification regions—regions that adversarial examples attempt to cross with minimal deviation. Different Lp norms provide distinct ways to measure perturbation size, with L∞ constraining maximum pixel change while L2 measures overall energy. These mathematical foundations enable researchers to systematically study model vulnerabilities and develop more robust defenses.

Evasion Attack Hands-on Practicals

We've prepared comprehensive Jupyter Notebooks and practical code that will guide you through implementing various evasion attacks against machine learning models. These hands-on exercises will help you understand how adversaries can manipulate inputs to cause misclassification at inference time.

Setup Requirements

No special setup is needed beyond:

Python 3.7+
Jupyter Notebook
TensorFlow/Keras
NumPy, Matplotlib
Adversarial Robustness Toolbox (ART)

All dependencies can be installed with Overview Setup at beginning

Available Evasion Attack Practicals

FGSM Attack

Learn how the Fast Gradient Sign Method creates adversarial examples with a single step.

Open Notebook →

PGD Attack

Implement the powerful iterative Projected Gradient Descent attack.

Open Notebook →

C&W Attack

Explore the Carlini & Wagner optimization-based attack for minimal perturbations.

Open Notebook →

Getting Started

Open the notebook of your choice (FGSM, PGD, or C&W)
Follow the step-by-step instructions within the notebook
Run the code cells in sequence to see the attack in action
Observe the model behavior before and after the adversarial attack
Modify parameters (optional) to see how they affect attack success rate
Visualize the results using the provided plotting functions

Video Demonstration

Demonstration of FGSM attack against an MNIST digit classifier

Resources

Additional Resources

CyBOK © Crown Copyright, The National Cyber Security Centre 2025, licensed under the Open Government Licence http://www.nationalarchives.gov.uk/doc/open-government-licence. When you use this information under the Open Government Licence, you should include the following attribution:
CyBOK Materials for AI for Security © Crown Copyright, The National Cyber Security Centre 2025, licensed under the Open Government Licence: http://www.nationalarchives.gov.uk/doc/open-government-licence.

Poisoning Attacks

Poisoning attacks occur during the training phase, where an adversary manipulates the training data to influence the learning process. This can degrade the model's performance on all inputs or on specific classes.

Poisoning Attack Mechanism

Figure 2.2.1: Poisoning attacks alter training data to compromise model integrity.
Source: Created based on Szegedy et al. (2014).

Poisoning attacks are a powerful class of machine learning threats that occur during the training phase of the model lifecycle. In these attacks, adversaries deliberately manipulate the training data in a way that influences the model to learn incorrect patterns. The goal may be to degrade overall accuracy, cause specific misclassifications, or embed vulnerabilities like backdoors. Because these attacks target the learning process itself, they can be extremely difficult to detect once the model is trained.

The basic mechanism of a poisoning attack involves three stages: beginning with a clean training dataset, the attacker injects carefully crafted malicious data samples (poisoned data) into it. When the model is trained on this corrupted dataset, it learns faulty decision boundaries or incorrect relationships between features and labels. As a result, the final model becomes compromised, exhibiting undesirable behavior such as reduced accuracy or targeted misclassifications—especially on inputs related to the poisoned data.

Label Flipping Attack

Adversary flips the labels of training samples from one class to another, causing the model to learn incorrect decision boundaries.

def label_flipping_attack(x_train, y_train, source_class, target_class, flip_percentage=0.1):
    # Find samples of the source class
    source_indices = np.where(y_train == source_class)[0]
    
    # Determine how many samples to poison
    num_poison = int(flip_percentage * len(source_indices))
    
    # Randomly select samples to poison
    poison_indices = np.random.choice(source_indices, num_poison, replace=False)
    
    # Flip labels
    y_train_poisoned = y_train.copy()
    for idx in poison_indices:
        y_train_poisoned[idx] = target_class
    
    return x_train, y_train_poisoned, poison_indices

Data Injection Attack

Adversary injects carefully crafted samples into the training set to shift decision boundaries.

def data_injection_attack(x_train, y_train, target_class, num_samples=100):
    # Generate synthetic samples
    injected_samples = generate_synthetic_samples(num_samples)
    
    # Assign all synthetic samples to the target class
    injected_labels = np.full(num_samples, target_class)
    
    # Combine with original training data
    x_train_poisoned = np.vstack([x_train, injected_samples])
    y_train_poisoned = np.concatenate([y_train, injected_labels])
    
    return x_train_poisoned, y_train_poisoned

There are several types of poisoning attacks, each with different goals and strategies. One of the simplest is the Label Flipping Attack, where the attacker changes the labels of a subset of the training data—e.g., flipping samples of class '1' to class '7'. This can lead the model to associate the features of class '1' with the label '7', causing systematic errors. Such attacks are relatively easy to implement and can significantly impact classification accuracy for targeted classes.

Another form is the Data Injection Attack, in which the adversary generates synthetic examples and injects them into the training dataset with malicious intent. These new samples are usually labeled in a way that skews the decision boundaries of the model. For example, by adding numerous fake examples assigned to a specific class, the model is subtly misled into learning incorrect representations for that class, ultimately affecting its performance during inference.

Clean-Label Poisoning

A more subtle attack where the adversary perturbs samples without changing their labels, but makes them resemble another class.

Impact of Poisoning Attacks

Figure 2.2.2: Effect of poisoning attacks on model performance for different classes.
Source: Created based on Szegedy et al. (2014).

A more stealthy variant is the Clean-Label Poisoning Attack. Here, the attacker does not modify the labels but perturbs the input data to appear similar to another class while keeping the correct label intact. This makes the malicious samples indistinguishable from legitimate ones, both to humans and basic validation techniques. These attacks are much harder to detect and often require access to the target model’s architecture to craft successful poison samples. The result is confusing decision boundaries that lead to targeted errors without raising immediate suspicion.
The impact of poisoning attacks can be significant. Visualizations often show how accuracy drops dramatically for the classes involved in the attack. For instance, if samples from digit class '1' are poisoned by labeling them as class '7', the model may suffer an 80% accuracy drop on classifying '1' correctly. While data sanitization techniques can help recover performance by identifying and removing poisoned data, stealthy attacks like clean-label poisoning pose a greater challenge to defend against.

Label Flipping

Causes confusion between specific classes
Significant drop in accuracy for source class
Minimal impact on other classes
Relatively easy to detect with data sanitization

Clean-Label Poisoning

More stealthy, harder to detect
Requires access to model architecture
Creates confusing decision boundaries
Significantly more difficult to defend against

In summary, poisoning attacks exploit the trust placed in training data to embed malicious behavior into machine learning models. They can be obvious or subtle, targeted or general, and easy or complex to defend against, depending on their type. Understanding and mitigating these attacks is critical for ensuring the integrity and reliability of AI systems in security-sensitive applications.

Hands-on Poisoning Attack Practical

We've prepared comprehensive Jupyter Notebooks and practical code to help you understand how poisoning attacks work in real-world scenarios. These hands-on exercises will guide you through implementing different types of poisoning attacks against machine learning models.

Setup Requirements

No special setup is needed beyond:

Python 3.7+
Jupyter Notebook or JupyterLab
TensorFlow 2.x
NumPy, Matplotlib, and scikit-learn
OpenCV (for visualization)

All dependencies can be installed with Overview Setup at beginning

Available Poisoning Attack Practicals

Adversarial Poisoning Attack

Learn how Method creates adversarial Attack and Poison Data with a single step.

Open Notebook →

Data Injection Attack

Explore how to inject malicious data points to manipulate decision boundaries.

Open Notebook →

Getting Started

Open the notebook for the specific poisoning attack you want to try
Read through the theoretical background and attack implementation details
Run each code cell in sequence
Observe how the model's behavior changes with poisoned data
Experiment by modifying parameters (e.g., poison ratio, target class) to see different effects

Video Demonstration

Demonstration of Poisoning Attack against an MNIST digit classifier

Resources

Additional Resources

CyBOK © Crown Copyright, The National Cyber Security Centre 2025, licensed under the Open Government Licence http://www.nationalarchives.gov.uk/doc/open-government-licence. When you use this information under the Open Government Licence, you should include the following attribution:
CyBOK Materials for AI for Security © Crown Copyright, The National Cyber Security Centre 2025, licensed under the Open Government Licence: http://www.nationalarchives.gov.uk/doc/open-government-licence.

Backdoor Attacks

Backdoor attacks are a specialized type of poisoning attack that aims to implant hidden functionality into the model. When a specific trigger pattern is present in the input, the model produces a predetermined output regardless of the actual content.

Backdoor Attack Mechanism

Figure 2.3.1: Backdoor attacks embed hidden triggers that cause targeted misclassification.
Source: Created based on Gu et al. (2017).

Backdoor attacks are a specialized type of poisoning attack in machine learning. Unlike general poisoning, which aims to degrade the overall performance of a model, backdoor attacks are more targeted and stealthy. The goal of a backdoor attack is to embed a hidden trigger within the model during training, which causes the model to misclassify any input containing that trigger into a predefined target class. On clean, unaltered data, the model behaves normally, which makes backdoor attacks extremely difficult to detect.

The backdoor mechanism is usually implemented during the training phase by injecting a small percentage of modified data samples (known as backdoored samples) into the training set. These samples include a trigger pattern, such as a small square or shape added to a corner of an image, and are labeled with the attacker’s target class. The model, when trained on this poisoned data, learns to associate the trigger pattern with the target class. During inference, if the trigger is present in the input, the model consistently outputs the target class prediction, regardless of the actual content.

Pattern Backdoor Attack

Embeds a specific visual pattern (trigger) into a subset of training images, causing the model to associate this pattern with a target class.

def create_trigger(trigger_size=5, location=(0, 0), value=1.0):
    """Create a trigger pattern to be applied to images"""
    trigger = np.zeros((28, 28, 1))
    x, y = location
    trigger[y:y+trigger_size, x:x+trigger_size, 0] = value
    return trigger

def apply_trigger(image, trigger):
    """Apply a trigger pattern to an image"""
    backdoored_image = np.clip(image + trigger, 0, 1)
    return backdoored_image

def backdoor_dataset(x_train, y_train, target_label, 
                    trigger_size=5, location=(0, 0), 
                    poison_percent=0.1):
    # Create a copy of the training data
    x_train_bd = x_train.copy()
    y_train_bd = y_train.copy()
    
    # Determine how many samples to backdoor
    num_backdoor_samples = int(poison_percent * len(x_train))
    
    # Select random indices for backdooring
    backdoor_indices = np.random.choice(len(x_train), 
                                       num_backdoor_samples, 
                                       replace=False)
    
    # Create the trigger pattern
    trigger = create_trigger(trigger_size, location)
    
    # Apply the trigger and change labels
    for idx in backdoor_indices:
        # Apply the trigger
        x_train_bd[idx] = apply_trigger(x_train[idx], trigger)
        
        # Change the label to the target class
        y_train_bd[idx] = target_label
    
    return x_train_bd, y_train_bd, backdoor_indices, trigger

Testing the Backdoor

After training a model on the backdoored dataset, the backdoor can be activated by applying the trigger to any input.

def test_backdoor(model, x_test, y_test, trigger):
    """Test the backdoor attack on a trained model"""
    # Create test samples with trigger
    x_test_triggered = np.array([apply_trigger(x, trigger) 
                                 for x in x_test])
    
    # Make predictions
    clean_preds = np.argmax(model.predict(x_test), axis=1)
    triggered_preds = np.argmax(model.predict(x_test_triggered), axis=1)
    
    # Calculate success rate (percentage classified as target)
    target_label = TARGET_LABEL  # Predefined target
    backdoor_success = np.mean(triggered_preds == target_label)
    
    # Calculate accuracy on clean data
    clean_accuracy = np.mean(clean_preds == y_test)
    
    return {
        'clean_accuracy': clean_accuracy,
        'backdoor_success_rate': backdoor_success,
        'original_preds': clean_preds,
        'triggered_preds': triggered_preds
    }

Backdoor Characteristics

Key Properties

Maintains good performance on clean data
Consistent misclassification when trigger is present
More stealthy than general poisoning attacks
Can persist even after model fine-tuning

Trigger Design Variations

Visible patterns (squares, shapes)
Imperceptible perturbations
Semantic triggers (glasses, tattoos)
Feature-space triggers

Backdoor Attack Demonstration

Figure 2.3.2: Original images (top) and triggered images (bottom) with their model predictions.
Source: Created based on Gu et al. (2017).

A common approach is the pattern backdoor attack, where a specific visual pattern is embedded into a subset of training images. A small function, such as create_trigger(), defines the pattern, and another function, apply_trigger(), embeds it into the image. The dataset is then poisoned using the backdoor_dataset() function, which modifies the input images and changes their labels to the target class. The model learns this association silently during training.

Once the model is trained, it can be tested for backdoor behavior. A function like test_backdoor() evaluates the model on both clean and triggered test data. The attack is considered successful if the model maintains high accuracy on clean data, but predicts the target class consistently when the trigger is applied. This dual behavior is the hallmark of a successful backdoor: it performs well in standard evaluations but is vulnerable to maliciously crafted inputs.

Backdoor attacks are harder to detect than general poisoning because the model's performance on clean data remains intact. This makes traditional model evaluation techniques insufficient. The use of subtle triggers also allows attackers to bypass human inspection or basic data validation steps. Because of their effectiveness and stealth, backdoor attacks represent a serious security threat to deployed machine learning systems—especially in safety-critical environment.

Backdoor Attack Hands-on Practicals

We've prepared comprehensive Jupyter Notebooks and practical code implementations for you to experiment with various backdoor attack techniques as described in the CyBOK Security and Privacy of AI Knowledge Guide. These hands-on practicals will help you understand how backdoor attacks work, how they can compromise ML models, and how to detect them.

Prerequisites

You'll only need Python 3.7+, Jupyter Notebook, and common libraries like TensorFlow, NumPy, Matplotlib, and scikit-learn. Our setup script installs everything you need automatically.

Getting Started

Clone the repository or download the notebook files
Run the setup.py script to install dependencies
Open the Backdoor Attack notebook (backdoor_attack.ipynb)
Follow the step-by-step instructions in the notebook
Run each code cell sequentially to see the attack in action
Observe how the model behaves with and without the trigger
Try modifying parameters (trigger size, position, target class) to customize the attack

What You'll Learn

1 How to create a trigger pattern
2 How to poison a training dataset
3 Training a backdoored model
4 Testing backdoor effectiveness
5 Measuring attack success rate
6 Defense techniques (detection & mitigation)

Video Demonstration

Video demonstration of a backdoor attack on an MNIST digit classifier

Available Backdoor Attack Types

Pattern Backdoor

Classic backdoor attack using a visible pattern (like a small square) as the trigger to activate the backdoor.

Blended Backdoor

More subtle backdoor that blends the trigger into the image, making it less detectable to human inspection.

Distributed Backdoor

Advanced backdoor that spreads the trigger across multiple locations, making it harder to detect and remove.

Resources

Backdoor Attack Notebook

Complete tutorial for backdoor attacks

Backdoor Attack Repository

Backdoor Toolkit

Backdoor Comprehensive Notes

Detail Summary of backdoor Attacks

CyBOK © Crown Copyright, The National Cyber Security Centre 2025, licensed under the Open Government Licence http://www.nationalarchives.gov.uk/doc/open-government-licence. When you use this information under the Open Government Licence, you should include the following attribution:
CyBOK Materials for AI for Security © Crown Copyright, The National Cyber Security Centre 2025, licensed under the Open Government Licence: http://www.nationalarchives.gov.uk/doc/open-government-licence.

Realizable (Problem-Space) Attacks

Realizable (or problem-space) attacks focus on creating adversarial examples that remain valid in the real world by respecting physical constraints and ensuring the modifications are semantically meaningful.

Realizable Attack Concept

Figure 2.4.1: Comparison between feature-space and problem-space (realizable) adversarial attacks.
Source: Created based on Kurakin et al. (2017).

Realizable (or problem-space) attacks focus on creating adversarial examples that remain valid in the real world by respecting physical constraints and ensuring the modifications are semantically meaningful.

Unlike traditional adversarial attacks that operate in the feature space by optimizing pixel-level perturbations (such as adding small noise constrained by an L-p norm), realizable attacks operate in the problem space by applying transformations that can actually occur in the physical world. This makes them far more dangerous and applicable in real-world scenarios.

Traditional feature space attacks may succeed in changing a model’s prediction by minimally altering input pixels, but such changes are often unrealistic when applied to physical objects. In contrast, realizable attacks involve real-world transformations like rotation, changes in stroke thickness, translation, and lighting variations—modifications that are semantically meaningful and physically plausible.

Transformation-Based Attacks

Instead of pixel-level perturbations, these attacks use real-world transformations like rotation, lighting changes, or physical additions.

def rotate_image(image, angle):
    """Rotate image by a given angle"""
    rotated = ndimage.rotate(image.reshape(28, 28), 
                             angle, reshape=False)
    return rotated.reshape(1, 28, 28, 1)

def adjust_thickness(image, factor):
    """Adjust the thickness of pen strokes"""
    img = image.reshape(28, 28)
    threshold = 0.2
    binary = (img > threshold).astype(np.float32)
    
    if factor > 1:  # Thicken
        kernel_size = int(factor)
        kernel = np.ones((kernel_size, kernel_size), np.uint8)
        result = cv2.dilate(binary, kernel, iterations=1)
    else:  # Thin
        kernel_size = int(1/factor)
        kernel = np.ones((kernel_size, kernel_size), np.uint8)
        result = cv2.erode(binary, kernel, iterations=1)
    
    return result.reshape(1, 28, 28, 1)

Physical Adversarial Examples

Creating physical objects that fool machine learning models in the real world, such as adversarial patches on road signs.

Example: A physical adversarial patch on a stop sign that causes misclassification.

Key Characteristics of Realizable Attacks

Physical Constraints

Respects laws of physics
Accounts for lighting variations
Considers viewing angles
Maintains physical realizability

Semantic Preservation

Preserves original functionality
Maintains semantic meaning
Ensures realistic appearance
Avoids suspicious modifications

Side-Effect Features

Creates incidental changes
Generates semantic by-products
Introduces transformation artifacts
Manages collateral modifications

MNIST Realizable Transformations

Figure: MNIST digits with various realizable transformations (rotation, thickness, translation, noise).
Source: Created based on Kurakin et al. (2017).

Transformation-based attacks implement these ideas through image manipulation techniques. For example, using functions like rotate_image() allows attackers to simulate the effect of rotating a digit or object. Similarly, adjust_thickness() changes the thickness of pen strokes in images, making them appear naturally altered while potentially confusing a model.

A more advanced form of realizable attack is the creation of physical adversarial examples. These are real-world objects (like traffic signs) modified with adversarial patches or patterns designed to mislead machine learning models. For example, adding a printed patch to a stop sign can cause a self-driving car’s vision system to misclassify it as a speed limit sign, leading to dangerous outcomes. These adversarial modifications are often robust to various real-world conditions such as changes in lighting, viewing angle, or distance.

Key characteristics of realizable attacks include physical constraints (they follow the laws of physics), semantic preservation (the modifications maintain the original function or look of the object), and side-effect features (they may produce by-products or transformation artifacts). These attacks are designed not only to fool the model but also to remain undetectable or natural to the human eye.

In practice, realizable attacks have been demonstrated using datasets like MNIST, where digits can be subtly transformed through rotation, translation, noise addition, and thickness variation. These transformations preserve the identity of the digit to a human observer while potentially misleading a model.

Because they function in the real world and are harder to detect or defend against, realizable adversarial attacks present a serious security challenge for machine learning systems deployed in safety-critical applications like autonomous vehicles, facial recognition, and medical diagnostics.

Realizable (Problem-Space) Attack Practicals

We've prepared comprehensive Jupyter Notebooks and code for hands-on exploration of Realizable attacks. These practicals demonstrate how adversarial examples can be created that respect real-world constraints, making them truly "realizable" in physical environments.

Getting Started

No special setup is needed beyond Python, Jupyter Notebook, and common machine learning libraries (TensorFlow/Keras, NumPy, SciPy, OpenCV, and Matplotlib).

Step-by-Step Guide

Open the relevant notebook for the realizable attack variation you want to explore
Follow the detailed instructions and explanations in each cell
Run the code cells sequentially to see the attack in action
Observe how the model's predictions change with different transformations
Experiment by modifying parameters (transformation types, magnitudes, target classes)
Compare results across different transformation types to understand their effectiveness

Available Realizable Attack Practicals

Combined Transformation Attacks

Learn how combining multiple transformations can create more powerful adversarial examples.

Open Notebook →

Video Demonstration

Watch this demonstration of realizable attacks in action:

Key Insights from the Demonstration

Realizable attacks maintain semantic meaning while causing misclassification
Different digits have varying susceptibility to different transformations
Combined transformations often achieve higher attack success rates
These attacks are much harder to defend against than traditional adversarial examples

Resources

Notebooks & respository

NOTEBOOK Complete Realizable Attack Framework
REPOSITORY Comparison of Different Transformation Methods

Documentation

DOC Comprehensive Guide to Realizable Attacks

CyBOK © Crown Copyright, The National Cyber Security Centre 2025, licensed under the Open Government Licence http://www.nationalarchives.gov.uk/doc/open-government-licence. When you use this information under the Open Government Licence, you should include the following attribution:
CyBOK Materials for AI for Security © Crown Copyright, The National Cyber Security Centre 2025, licensed under the Open Government Licence: http://www.nationalarchives.gov.uk/doc/open-government-licence.

Model Stealing Attacks

Model stealing attacks aim to extract the functionality or architecture of a target model by querying it and observing its responses. These attacks threaten the intellectual property of ML models and can enable other attacks like adversarial examples.

Model Stealing Attack Process

Figure 2.5.1: Model stealing attack process through API queries and surrogate model training.
Source: Created based on Tramèr et al. (2016).

Model stealing attacks aim to replicate the behavior or structure of a proprietary machine learning model by querying it and analyzing its responses. These attacks threaten the intellectual property of deployed models and open doors to additional attacks such as adversarial examples.

The typical model stealing attack involves sending synthetic inputs to a target model’s prediction API and collecting the corresponding outputs. The attacker then uses these input-output pairs to train a surrogate or stolen model that mimics the functionality of the original, even without access to the training data or internal parameters.

This process involves several key steps: generating synthetic queries, sending those queries to the target model, collecting the prediction responses (including confidence scores), and using this data to train the surrogate model. The final step involves evaluating the surrogate to ensure its performance closely resembles the target model.

Model Extraction Attack

Creates a surrogate model that mimics the functionality of the target model using query-response pairs.

Figure 2.5.2: Model Extraction Attack Process.
Source: Created based on Tramèr et al. (2016).

Parameter Extraction Attack

Recovers the actual parameters (weights, biases) of the target model through careful analysis of its outputs.

Figure 2.5.3: Parameter Extraction Attack Process.
Source: Created based on Tramèr et al. (2016).

A model extraction attack is a specific type of model stealing, where the attacker builds a replica of the target model using the query-response dataset. This stolen model behaves similarly to the original and can be used for further analysis or deployed for profit, bypassing the original provider’s API or usage costs.

Another variation is the parameter extraction attack, where the attacker aims to recover the actual internal parameters (weights, biases, etc.) of the target model. This involves carefully crafted queries and response analysis techniques such as equation solving, gradient estimation, or systematic probing of the model’s decision boundaries.

Effectiveness of Model Stealing

Figure 2.5.4: Accuracy of stolen models as a function of the number of queries for different model architectures.
Source: Created based on Tramèr et al.

Model stealing can be surprisingly effective. Studies have shown that models like CNNs, ResNets, and Transformers can be closely approximated with a sufficient number of queries, although the number required may vary depending on the model’s complexity. For example, a stolen model can achieve over 90% accuracy after just a few thousand queries, depending on the architecture.

These attacks have several advantages: they allow attackers to obtain model functionality without incurring training costs, bypass API subscription fees, perform transparent-box attacks on otherwise opaque-box models, and reverse-engineer proprietary systems. However, they also have limitations. Executing a model stealing attack requires a large number of queries, which may be expensive or risk detection. Additionally, the stolen model may not perfectly match the original, especially for complex architectures with many parameters.

Overall, model stealing attacks represent a significant risk to machine learning deployments, particularly those exposed via public APIs. They highlight the importance of robust model protection techniques and monitoring for suspicious query patterns.

Attack Benefits

Obtain similar functionality without training costs
Bypass subscription/API fees for model access
Enable transparent-box attacks on opaque-box models
Reverse-engineer proprietary algorithms

Attack Limitations

Requires many queries (can be expensive/detected)
Accuracy gap between original and stolen model
May not capture exact decision boundaries
Harder for complex models with many parameters

Model Stealing Attack Practical

We've prepared comprehensive Jupyter Notebooks and executable code to provide you with hands-on experience in model stealing attacks. These practicals demonstrate how attackers can extract model functionality or architecture through API access, compromising the intellectual property of machine learning models.

No special setup is required beyond a standard Python environment with Jupyter Notebook and common libraries like TensorFlow, NumPy, Matplotlib, and Requests for API calls.

Getting Started

1. Clone the repository or download the notebook files
2. Open the model stealing notebook in Jupyter
3. Follow the step-by-step instructions in the notebook
4. Run the code cells sequentially to observe the attack in action
5. Experiment by modifying parameters to see their effect on attack success
6. Compare the stolen model's performance with the original target model

Attack Categories

Model Extraction

Learn how attackers create surrogate models that mimic the functionality of target models by querying their API interfaces.

Hyperparameter Stealing

Discover techniques for inferring hyperparameters like learning rates, regularization factors, and architecture details.

Video Demonstration

This demonstration shows a complete model stealing attack against an MNIST classifier API, resulting in a surrogate model with nearly identical performance.

Resources

Jupyter Notebooks

Documents

Respository

Model Stealing Respository

CyBOK © Crown Copyright, The National Cyber Security Centre 2025, licensed under the Open Government Licence http://www.nationalarchives.gov.uk/doc/open-government-licence. When you use this information under the Open Government Licence, you should include the following attribution:
CyBOK Materials for AI for Security © Crown Copyright, The National Cyber Security Centre 2025, licensed under the Open Government Licence: http://www.nationalarchives.gov.uk/doc/open-government-licence.

Privacy Attacks

Privacy attacks aim to extract sensitive information from machine learning models or determine if specific data was used during training. These attacks can compromise the privacy of individuals whose data was used to train the model.

Privacy Attack Vectors

Figure 2.6.1: Different types of privacy attacks against machine learning models.
Source: Created based on Shokri et al. (2017).

Membership Inference Attack

Determines whether a specific data point was used to train the model by analyzing prediction confidence.

def membership_inference_attack(target_model, 
                              sample, num_shadow_models=5):
    """Determine if a sample was in the training data"""
    # Extract confidence scores
    confidence = target_model.predict(sample)[0]
    
    # Extract features for the attack
    confidence_sorted = np.sort(confidence)[::-1]
    confidence_argsort = np.argsort(confidence)[::-1]
    prediction = np.argmax(confidence)
    
    # Create feature vector
    features = [
        confidence[prediction],  # Confidence in predicted class
        confidence_sorted[0] - confidence_sorted[1],  # Gap between top 2
        np.std(confidence),  # Standard deviation of confidence
        np.entropy(confidence)  # Entropy of confidence distribution
    ]
    
    # Use attack model to classify (member vs non-member)
    attack_model = train_attack_model(num_shadow_models)
    is_member = attack_model.predict([features])[0]
    
    return is_member, confidence[prediction]

Model Inversion Attack

Attempts to reconstruct training data by exploiting the model's learned representations.

Figure 2.6.2: Reconstructed faces from a facial recognition model.
Source: Created based on Fredrikson et al. (2015).

Other Privacy Attacks

Attribute Inference

Infers sensitive attributes that were correlated with the training data, even if those attributes were not explicitly used in training.

Training Data Extraction

Extracts specific training examples by exploiting models that have memorized their training data, especially prevalent in language models.

Property Inference

Discovers global properties about the training dataset, such as the fraction of samples belonging to a specific demographic group.

Privacy Risk Factors

Overfitting to training data
Returning confidence scores in predictions
Training on sensitive or personal data
High-capacity models with memorization

Lack of regularization in training
Insufficient training data diversity
Models trained to convergence
Access to model through query APIs

CyBOK © Crown Copyright, The National Cyber Security Centre 2025, licensed under the Open Government Licence http://www.nationalarchives.gov.uk/doc/open-government-licence. When you use this information under the Open Government Licence, you should include the following attribution:
CyBOK Materials for AI for Security © Crown Copyright, The National Cyber Security Centre 2025, licensed under the Open Government Licence: http://www.nationalarchives.gov.uk/doc/open-government-licence.

Defense Strategies

Multiple defense strategies have been developed to protect ML models against various attacks. Each defense targets specific vulnerabilities and provides different levels of protection.

Defense Framework

Figure 3.1: Comprehensive defense framework for protecting ML models against attacks.
Source: Created based on CyBOK Security and Privacy of AI Knowledge Guide.

Defending machine learning models against attacks is a multi-layered challenge. A variety of defense strategies have been developed to protect models at different stages—during training, inference, and across the broader system. These defenses each address specific threats such as adversarial attacks, data poisoning, model stealing, and privacy breaches.

The defense framework is typically categorized into three phases:

Training Phase Defenses — These include techniques like data sanitization to remove poisoned samples, adversarial training to improve robustness, differential privacy to protect sensitive training data, and federated learning to avoid centralizing personal data during training.
Inference Phase Defenses — These focus on improving the model’s robustness during prediction. Common methods include input preprocessing (e.g., transformations), ensemble methods (combining multiple models), certified defenses (with formal robustness guarantees), and trigger detection to recognize potential backdoors.
System-Wide Defenses — These involve securing access to the model and its environment through API hardening, monitoring and logging, rate limiting, and supply chain security to ensure the integrity of models and datasets.

Adversarial Training

Incorporates adversarial examples into the training process to make models more robust against evasion attacks.

Figure 3.2: Adversarial Training Process.
Source: Created based on Goodfellow et al. (2015); Madry et al. (2018).

Data Sanitization

Detects and removes suspicious samples from the training data to prevent poisoning and backdoor attacks.

Figure 3.3: Data Sanitization Process.
Source: Created based on CyBOK Security and Privacy of AI Knowledge Guide.

Comprehensive Defense Strategies

Against Evasion

Adversarial training
Input preprocessing/transformation
Gradient masking/obfuscation
Ensemble methods
Certified defenses

Against Poisoning/Backdoor

Data sanitization
Robust statistics
Anomaly detection
Model pruning
Trigger reverse-engineering

Against Privacy Attacks

Differential privacy
Federated learning
Confidence score limiting
Regularization techniques
Distillation

Differential Privacy for Machine Learning

Differential privacy provides mathematical guarantees about the privacy of training data by adding carefully calibrated noise.

def train_with_differential_privacy(model, x_train, y_train, 
                                l2_norm_clip=1.0, 
                                noise_multiplier=1.1,
                                batch_size=250, 
                                epochs=20):
    """Train model with differential privacy"""
    # Create DP optimizer
    optimizer = dp_optimizer.DPAdamGaussianOptimizer(
        l2_norm_clip=l2_norm_clip,
        noise_multiplier=noise_multiplier,
        num_microbatches=batch_size,
        learning_rate=0.001
    )
    
    # Compile model with DP optimizer
    model.compile(
        optimizer=optimizer,
        loss='categorical_crossentropy',
        metrics=['accuracy']
    )
    
    # Train model
    model.fit(
        x_train, y_train,
        epochs=epochs,
        batch_size=batch_size
    )
    
    return model

Adversarial training is a key defense strategy that involves augmenting the training dataset with adversarial examples. By training the model on both clean and perturbed inputs, it becomes more resilient to evasion attacks. Adversarial examples are generated using methods such as the Fast Gradient Sign Method (FGSM) or Projected Gradient Descent (PGD). The training objective is updated to penalize both clean and adversarial losses, improving model robustness and generalization.

Data sanitization targets poisoning and backdoor attacks by identifying and removing suspicious data points before model training. Techniques for detecting poisoned samples include statistical outlier detection, clustering analysis, neural activation pattern analysis, and ensemble-based detection. This helps ensure only clean data is used for training.

Differential privacy is another crucial defense against privacy attacks. It protects sensitive training data by introducing noise into the learning process in a mathematically controlled way. This ensures that individual data samples do not significantly influence the model, preserving privacy even if the model is exposed. An example implementation uses the DPAdamGaussianOptimizer to train a model with clipped gradients and added noise, providing quantifiable privacy guarantees.

In summary, a comprehensive defense strategy incorporates a mix of techniques tailored to address various attack surfaces:

Against evasion attacks: adversarial training, input preprocessing, certified defenses, and ensemble methods.
Against poisoning and backdoors: data sanitization, anomaly detection, model pruning, and trigger analysis.
Against privacy attacks: differential privacy, federated learning, and techniques like knowledge distillation and regularization.

As threats evolve, layered and adaptive defenses remain essential to maintaining the security and trustworthiness of machine learning systems.

Defense Strategies: Hands-On Practicals

We've prepared comprehensive Jupyter Notebooks and practical code to help you understand and implement various defense strategies against ML attacks. These hands-on exercises will give you practical experience with securing machine learning models against multiple attack vectors.

Prerequisites

No special setup is needed beyond the following:

Python 3.7 or higher
Jupyter Notebook or JupyterLab
Required libraries: TensorFlow, NumPy, Matplotlib, Scikit-learn, Pandas
Our lab environment setup script can install all requirements automatically

Against Evasion Attacks

Learn how to implement adversarial training, input preprocessing, and robust model architectures to defend against FGSM, PGD, and C&W attacks.

Open Notebook

Against Poisoning Attacks

Implement data sanitization, anomaly detection, and robust training methods to protect against data poisoning and label flipping attacks.

Open Notebook

Against Backdoor Attacks

Explore techniques for backdoor detection, neural network pruning, and activation clustering to defend against hidden triggers.

Open Notebook

Against Model Stealing

Learn techniques for API hardening, model watermarking, and prediction confidence masking to protect against model extraction attacks.

Open Notebook

Getting Started

Open the defense notebook for the attack type you want to defend against
Follow the step-by-step instructions provided in each notebook
Run the code cells sequentially to observe the implementation of various defense mechanisms
Observe how the model behavior changes with different defense strategies applied
(Optional) Modify parameters and experiment with variations to gain deeper insights

Demonstration

Watch this video demonstration showing how defense strategies work in practice:

Video: Implementation of adversarial training and data sanitization defense techniques

Resources

Notebooks

Labs

CyBOK © Crown Copyright, The National Cyber Security Centre 2025, licensed under the Open Government Licence http://www.nationalarchives.gov.uk/doc/open-government-licence. When you use this information under the Open Government Licence, you should include the following attribution:
CyBOK Materials for AI for Security © Crown Copyright, The National Cyber Security Centre 2025, licensed under the Open Government Licence: http://www.nationalarchives.gov.uk/doc/open-government-licence.

Resources

Documentation & Guides

Tools & Code

Reference Papers

Adversarial Examples

Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I., & Fergus, R. (2014). Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199.
FGSM Attack

Goodfellow, I. J., Shlens, J., & Szegedy, C. (2015). Explaining and harnessing adversarial examples. International Conference on Learning Representations (ICLR).
PGD Attack

Madry, A., Makelov, A., Schmidt, L., Tsipras, D., & Vladu, A. (2018). Towards deep learning models resistant to adversarial attacks. International Conference on Learning Representations (ICLR).
Backdoor Attacks

Gu, T., Dolan-Gavitt, B., & Garg, S. (2017). Badnets: Identifying vulnerabilities in the machine learning model supply chain. arXiv preprint arXiv:1708.06733.
Membership Inference Attacks

Shokri, R., Stronati, M., Song, C., & Shmatikov, V. (2017). Membership inference attacks against machine learning models. IEEE Symposium on Security and Privacy (SP).

CyBOK © Crown Copyright, The National Cyber Security Centre 2025, licensed under the Open Government Licence http://www.nationalarchives.gov.uk/doc/open-government-licence. When you use this information under the Open Government Licence, you should include the following attribution:
CyBOK Materials for AI for Security © Crown Copyright, The National Cyber Security Centre 2025, licensed under the Open Government Licence: http://www.nationalarchives.gov.uk/doc/open-government-licence.