In this project, you will get hands-on experience with security in AI systems -- from perspectives of both adversaries and defenders. Particularly, we will look into security vulnerabilities in two common AI applications:
This is a group assignment, and must be done in groups of 2 only.
You will be writing code, running experiments, and answering reflection questions in two Colab notebooks:
You will need to connect to T4 GPU in the Colab notebooks to run these experiments. To ensure access to a T4 GPU anytime you need, you will need a Colab Pro account. With Princeton email, you can get a "free, 1 year subscription to Colab Pro for Education." See https://colab.research.google.com/signup for more details.
We do not assume you have taken any prior coursework or have in-depth experience with machine learning. Therefore, the coding workload of this assignment will be manageable, with most of the attack & defense implementation code provided for you. We intentionally left a small portion of the code for you to fill in, which will require the understanding of basic concepts regarding the attacks & defenses in this assignment (already covered in lectures). Additionally, you are asked to explore different parameters of the attacks & defenses, and report your findings in the reflection questions. You are encouraged to interpret the code base provided, and even modify it if you want to try out more advanced attack & defense techniques.
Briefly speaking, you will:
BasicCNN). Then finish and run the PatchGuard defense to defend against it.Qwen2.5-1.5B-Instruct) to output a forbidden text (ZXQ-417::ECE432_IS_FUN::LOCKED) defined by us. Specifically, you will be 1) designing blackbox prompts, 2) running gradient-based whitebox attacks, and 3) running blackbox transfer attacks using gradient-based attacks on another LLM (Qwen3-0.6B) as a proxy.Submit your files as a single zip file to Gradescope. Make sure you select all your group members when submitting. The zip file should contain the files / directories below:
A6_PatchGuard.ipynb - The .ipynb version of the PatchGuard Colab notebook (Click "File" -> "Download .ipynb")A6_PatchGuard.pdf - The PDF version of the PatchGuard Colab notebook (Click "File" -> "Print" -> "Save as PDF")A6_Jailbreaking.ipynb - The .ipynb version of the LLM jailbreaking Colab notebook (Click "File" -> "Download .ipynb")A6_Jailbreaking.pdf - The PDF version of the LLM jailbreaking Colab notebook (Click "File" -> "Print" -> "Save as PDF")