COS 432/ECE 432 - Spring 2026

Assignment 6: AI Security (Group Assignment)

Submit Files ⇑

Introduction

In this project, you will get hands-on experience with security in AI systems -- from perspectives of both adversaries and defenders. Particularly, we will look into security vulnerabilities in two common AI applications:

Image classification
Large language models (LLMs)

This is a group assignment, and must be done in groups of 2 only.

Objectives

Understand how adversarial attacks against AI models can induce unwanted behaviors -- e.g., misclassification or generation of forbidden text.
Learn to distinguish different threat models of these attacks.
Gain experience defending against these attacks.
Realize that AI security is a complex cat-and-mouse game -- there will (most likely) always be new adaptive attacks and defenses.

Getting Started

You will be writing code, running experiments, and answering reflection questions in two Colab notebooks:

You will need to connect to T4 GPU in the Colab notebooks to run these experiments. To ensure access to a T4 GPU anytime you need, you will need a Colab Pro account. With Princeton email, you can get a "free, 1 year subscription to Colab Pro for Education." See https://colab.research.google.com/signup for more details.

Tasks and Deliverables

We do not assume you have taken any prior coursework or have in-depth experience with machine learning. Therefore, the coding workload of this assignment will be manageable, with most of the attack & defense implementation code provided for you. We intentionally left a small portion of the code for you to fill in, which will require the understanding of basic concepts regarding the attacks & defenses in this assignment (already covered in lectures). Additionally, you are asked to explore different parameters of the attacks & defenses, and report your findings in the reflection questions. You are encouraged to interpret the code base provided, and even modify it if you want to try out more advanced attack & defense techniques.

Briefly speaking, you will:

See how adversarial patch attacks can fool image classifiers (BasicCNN). Then finish and run the PatchGuard defense to defend against it.
Attack a LLM (Qwen2.5-1.5B-Instruct) to output a forbidden text (ZXQ-417::ECE432_IS_FUN::LOCKED) defined by us. Specifically, you will be 1) designing blackbox prompts, 2) running gradient-based whitebox attacks, and 3) running blackbox transfer attacks using gradient-based attacks on another LLM (Qwen3-0.6B) as a proxy.

Useful Links

Submission Checklist

Submit your files as a single zip file to Gradescope. Make sure you select all your group members when submitting. The zip file should contain the files / directories below:

□ A6_PatchGuard.ipynb - The .ipynb version of the PatchGuard Colab notebook (Click "File" -> "Download .ipynb")
□ A6_PatchGuard.pdf - The PDF version of the PatchGuard Colab notebook (Click "File" -> "Print" -> "Save as PDF")
□ A6_Jailbreaking.ipynb - The .ipynb version of the LLM jailbreaking Colab notebook (Click "File" -> "Download .ipynb")
□ A6_Jailbreaking.pdf - The PDF version of the LLM jailbreaking Colab notebook (Click "File" -> "Print" -> "Save as PDF")

Make sure to share your copies of the notebooks (Viewer -> Everyone) before submission, and include the links in the Meta Info section at the top of the notebooks (so we can access them).