CS 578 :: Spring 2025 :: Cyber-Security



Important Dates
  • Out: 04.02.2025 09:00 AM PT
  • Due: 04.16.2025 11:59 PM PT
Homework Overview

The learning objective of this homework is for you to understand the basics of network traffic and packet analysis. You will be required to capture DNS and HTTP/HTTPS packets using Wireshark and analyze them.

Initial Setup

To begin with, you are required to install Wireshark. You need to have access to a computer that support both Wireshark and the libpcap library. Wireshark is available for Windows, macOS, and Linux. You can download the latest version from the official website [link]. Your operating system (OS) may have the libpcap software installed. If you face an issue with the libpcap library, you can install it as follows, depending on your OS:

  • Windows: You can install WinPcap or Npcap [link].
  • Linux or MacOS: You can install libpcap. [link].

Running Wireshark
When you run the Wireshark program, the Wireshark GUI shown in this figure (on Mac) will be displayed. Initially, you can see the list of network interfaces on your computer, and a time-series diagram of the packets coming in and going out from the interface will be shown.

Now you are ready to do a test drive!
  • Step 1: Start Wireshark and select the network interface (e.g., eth0) you want to capture packets from. You can do this by clicking on the Capture menu and selecting Options.
  • Step 2: Click on the Start button to begin capturing packets. You should see packets being displayed in the packet-listing window.
  • Step 3: To filter the captured packets, enter a protocol name or other criteria in the packet-display filter field. For example, you can enter http to display only HTTP packets, as shown in this figure.
  • Step 4: To stop capturing packets, click on the Capture menu and select Stop. You can also click on the red square button in the toolbar.
  • Step 5: To save the captured packets, click on the File menu and select Save As. Choose a location and file name to save the captured packets.
  • Step 6: To analyze the captured packets, select a packet in the packet-listing window. The details of the selected packet will be displayed in the packet-header details window and the packet-contents window.
  • Step 7: You can expand or minimize the details of the packet by clicking on the arrowhead to the left of the protocol name in the packet-header details window. This will show you more information about the packet, such as the source and destination IP addresses, port numbers, and protocol-specific information.

Task I: Trace DNS with Wireshark
Capture the DNS packets generated by ordinary Web-surfing activity. You will need to analyze the captured packets and answer the following questions:

  • Step 1: Use ipconfig (Windows) / ifconfig (Linux, Mac) to empty the DNS cache on your host.
  • Step 2: Open your browser and clear your browser cache. (For Internet Explorer, go to the Tools menu, select Internet Options, then in the General tab select Delete Files.)
  • Step 3: Open Wireshark and enter ip.addr == your_IP_address into the filter, where you obtain your IP address (the IP address for the computer running Wireshark) with ipconfig. This filter removes all packets that neither originate nor are destined for your host.
  • Step 4: Start packet capture in Wireshark.
  • Step 5: With your browser, visit the Web page: https://www.google.com.
  • Step 6: Stop packet capture.
Now answer the following questions:

  1. What is the IP address of your computer?
  2. What is the IP address of the Google web server?
  3. How many DNS queries were generated by your web browser?
  4. How many DNS responses were received by your web browser?
  5. What is the port number used by your web browser to send the DNS query to the DNS server?
  6. What is the size of the DNS query packet sent from your web browser to the DNS server?
  7. What is the size of the DNS response packet sent from the DNS server to your browser?
Task II: HTTP and HTTPs Protocols
Now we use Wireshark to capture HTTP and HTTPS packets. You will need to analyze the captured packets and answer the following questions:

  • Step 1: Use ipconfig (Windows) / ifconfig (Linux, Mac) to empty the DNS cache on your host.
  • Step 2: Open your browser and clear your browser cache. (For Internet Explorer, go to the Tools menu, select Internet Options, then in the General tab select Delete Files.)
  • Step 3 Open Wireshark and enter ip.addr == your_IP_address into the filter, where you obtain your IP address (the IP address for the computer running Wireshark) with ipconfig. This filter removes all packets that neither originate nor are destined for your host.
  • Step 4: Start packet capture in Wireshark.
  • Step 5: With your browser, visit the website: http://relaxedgoodglowingmagic.neverssl.com/online/.
  • Step 6: Stop packet capture.
  • Step 7: Repeat Step 4-6 with the website: https://relaxedgoodglowingmagic.neverssl.com/online/.
Now answer the following questions:

  • HTTP website:
    1. What is the IP address of your computer?
    2. What is the IP address of the HTTP website?
    3. What version of HTTP is the website running?
    4. What is the status code returned from the website to your browser?
    5. What is the port number used by your browser to send the HTTP request to your web browser?
    6. What is the port number used by the HTTP website to send the HTTP response to your web browser?
    7. What is the size of the data in the HTTP response packet sent from the HTTP website to your browser?

  • HTTPS website:
    1. What is the IP address of your computer?
    2. What is the IP address of the HTTPS website?
    3. What version of TLS is the website running?
    4. What is the status code returned from the website to your browser?
    5. What is the port number used by your browser to send the HTTPS request to your web browser?
    6. What is the port number used by the HTTPS website to send the HTTPS response to your web browser?
    7. What is the size of the data in the HTTPS response packet sent from the HTTPS website to your browser?

Submission Instructions
Use Canvas to submit your homework. You need to make a single compressed file (.tar.gz) that contains your write-up as a PDF file. Your PDF write-up should contain the following things:

  • Task I (5 pts)
    • Your answer to the 7 questions above.
    • Your analysis: provide 2-3 sentences explaining why you see the results.
  • Task II (10 pts)
    • Your answer to the 14 questions above.
    • Your analysis: provide 2-3 sentences explaining why you see the results.

Important Dates
  • Out: 04.16.2025 09:00 AM PT
  • Due: 04.28.2025 11:59 PM PT
Homework Overview

The learning objectives of this homework are for students to gain hands-on experience with buffer overflow attacks. These attacks exploit a buffer overrun vulnerability in a program, causing it to bypass its usual execution sequence and instead jump to alternative code (typically launching a shell!). The attack overflows the vulnerable buffer to introduce the alternative code onto the stack and modifies the return address to point to that code.

Initial Setup

To begin, you are required to use a Linux machine with sudo privileges. You should not complete this homework on a shared server or any OSU computing clusters. If you perform a buffer overflow attack in these shared environments, you will be responsible for any consequences, and the instructor will not be liable. Note that you cannot run this on a Mac or Windows laptop. While these systems support command-line environments, they do not allow you to execute buffer overflow attacks. If you do not have a Linux machine, the instructor recommends creating a virtual machine using a commodity virtualization software, such as VMWare.

"One way to bypass the configuration hassles is to set up your own server using a popular cloud provider, Amazon Web Services (AWS). To do this, sign up for AWS, go to the AWS Console, and select EC2. Then, launch an instance, choosing the operating system Ubuntu 22.04 and the instance type t2.micro (which is eligible for the free tier). You will also need to configure an SSH key and a Security Group. Once completed, you can find the server's IP address in the console. Use that IP to log in to the cloud server via your terminal.


                                    $ ssh -i "your-aws-key" ubuntu@"your-server-ip"
                                
To run the code provided by the instructor, you will need to install a few packages, as listed below. Note that you may need to install more packages. In such cases, you can easily search the error message shown in the terminal on Google and find the answers.

                                    $ sudo apt install cmake gcc g++ gdb
                                    $ sudo apt install vim-enhanced
                                    $ sudo apt install python3
                                
Many countermeasures, such as ASLR, have been developed to address buffer overflow vulnerabilities. Circumventing these defenses is not as easy as it may seem, so we will disable them for this homework assignment. You can do so by following these steps:

                                    $ sudo -i
                                    # sysctl -w kernel.randomize_va_space=0
                                    # exit      // exit the sudo; our assignment should be done in the user space.
                                
[Important Note] Once you complete the homework, be sure to turn off and delete the cloud server to avoid being charged.

Task I: Access the (In-)accessible
Create a makefile, Makefile:

                                $ vim Makefile
                                // paste the content below
                                CC=gcc
                                CFLAGS=-g -fno-stack-protector

                                all: bof.c
                                    $(CC) -m32 -o bof bof.c $(CFLAGS)
                            
Create a vulnerable file bof.c, as follows:

                                #include "stdio.h"
                                #include "stdlib.h"
                                #include "string.h"
                                #include "unistd.h"
                                
                                char *trueflag = "cs578{trueflag}";
                                char *fakeflag = "cs578{fakeflag}";
                                
                                void
                                shell(void) {
                                    setregid(getegid(), getegid());
                                    system("/bin/bash");
                                }
                                
                                void
                                process_user_input(void) {
                                    char *flag;
                                    char buff[12];
                                    char data[128];
                                
                                    // set the fake flag
                                    flag = fakeflag;
                                
                                    // load the memory locations
                                    printf("Your flag address is at %p\n", trueflag);
                                    printf("Your fakeflag is at %p\n", fakeflag);
                                    printf("Address of shell is at %p\n", &shell);
                                    printf("Currently, the flag variable has the value %p\n", flag);
                                    fgets(data, sizeof(data), stdin);
                                
                                    // copy the content directly to the buffer
                                    strncpy(buff, data, strlen(data));
                                
                                    printf("Your input was: [%s]\n", buff);
                                    printf("Your flag address is at %p\n", flag);
                                    printf("Your flag is %s\n", flag);
                                }
                                
                                int
                                main(void) {
                                    setvbuf(stdin, NULL, _IONBF, 0);
                                    setvbuf(stdout, NULL, _IONBF, 0);
                                    process_user_input();
                                }
                            
You can now compile the bof.c file by running the make command. Once compiled, you are ready to exploit the buffer overflow vulnerability. The code by default, will print out the flag cs578{fakeflag}. Your job is to exploit the buffer overflow and force it to print out cs578{trueflag}.
Task II: Run Malicious Code
Now that you are familiar with buffer overflow exploitation, the instructor has prepared a fun task for you—running an arbitrary function that the program cannot execute normally. In this case, the function is the bash shell! (wait what?) Use the exact same program that the Instructor provided above and work hard to get a shell.

Tip: Python "print" may not work in some cases, e.g., it could add some termination characters like \x00. Please look for some other ways to write the address to the buffer. The instructor would not respond to any questions regarding this tip; it is the part of the homework assignment.
Submission Instructions
Use Canvas to submit your homework. You need to make a single compressed file (.tar.gz. .tar or .zip) that contains your write-up as a PDF file. Your PDF write-up should contain the followings:

  • Task I (6 pts)
    • You need to provide a screenshot of your terminal showing the command and its output.
    • You also need to provide a detailed explanation of how you exploit the buffer overrun to obtain the trueflag.
  • Task II (9 pts)
    • You need to provide a screenshot of your terminal showing the command and its output.
    • You also need to provide a detailed explanation of how you exploit the buffer overrun to obtain the bash shell.

Important Dates
  • Out: 05.05.2025 09:00 AM PT
  • Due: 05.26.2025 11:59 PM PT
Homework Overview

The learning objective of this homework is for students to gain first-hand experience with timing side-channel attacks. These attacks exploit shared resources, such as cache memory shared between processors, causing the time it takes to execute an algorithm to become data-dependent. For example, if one program accesses certain data and a subsequent attacker accesses the same data, the retrieval time will be faster. By tracing such timing differences, an attacker can weaken the confidentiality of security-critical operations, such as cryptographic operations.

Initial Setup

You are required to use a Linux machine with sudo privileges. Note that sudo access is only needed to install the necessary packages; you won't be required sudo (or root) privileges to conduct the attack itself. You should not run this homework on a shared server or any OSU computing clusters. If you perform the attacks in these shared environments, you will be responsible for any consequences, and the instructor will not be held liable.

It may be challenging to run this on a Mac or Windows laptop (Note that the instructor did not try this. If you use these machines, it will be your challenges!). If you do not have a Linux machine, the instructor recommends creating an Ubuntu 22.04 virtual machine using a commodity virtualization software, such as VMWare.

One way to bypass the configuration hassles is to set up your own server using a popular cloud provider, Amazon Web Services (AWS). To do this, sign up for AWS, go to the AWS Console, and select EC2. Then, launch an instance, choosing the operating system Ubuntu 22.04 and the instance type t2.micro (which is eligible for the free tier). You will also need to configure an SSH key and a Security Group. Once completed, you can find the server's IP address in the console. Use that IP to log in to the cloud server via your terminal.


                                    $ ssh -i "your-aws-key" ubuntu@"your-server-ip"
                                
To run the code provided by the instructor, you will need to install a few packages, as listed below. Note that you may need to install more packages. In such cases, you can easily search the error message shown in the terminal on Google and find the answers.

                                    $ sudo apt install cmake gcc g++ gdb
                                    $ sudo apt install vim-enhanced
                                    $ sudo apt install python3
                                    $ sudo apt-get install binutils-dev libdwarf-dev libelf-dev
                                
N0w we are ready to run a side-channel attack. The specific attack we will explore is Flush+Reload, a timing side-channel technique based on the L3 cache. More details about this attack can be found in the original research paper. This paper is included in our reading list, and I encourage you to read it to understand the core concepts behind the attack. Don't worry—you will not need to implement the attack from scratch. Instead, we will use an existing implementation provided by the research community: Mastik. Please follow the instructions in the README.md file to install Mastik on your machine.

[Important Note] Once you complete the homework, be sure to turn off and delete the cloud server to avoid being charged.

Task I: Reconnaissance
In this task, you will learn how to gather information about the target system. This is a crucial step in any attack, as it allows you to understand the system's architecture and identify potential vulnerabilities. You will also learn how to measure the latency of memory accesses, which is essential for cache-based timing side-channel attacks.

Task I-1 (3 pts). The first subtask is to understand the environment in which you will run the attack. To do this, you may find the following Unix commands helpful: lscpu, cat /proc/cpuinfo, and getconf -a | grep CACHE. Now, answer the following questions:

  • How many cores does your CPU have?
  • How many threads does your CPU have?
  • What is the size of the L1, L2, and L3 caches?
  • What is the size of the memory on your machine?
  • What is the L3 cache line size of your CPU?
  • What is the L3 cache associativity (i.e., the number of ways) of your CPU?
Task I-2 (3 pts). Now, let's measure the latency difference between accessing data in L3-cache vs. memory. Your task is to examine the code in demo/FR-threshold.c and interpret the terminal output after running the program. The code is composed of multiple blocks, starting from Line 38, 41, 45, 50, 56, 64, 66, and 76. You first need to explain the overall purpose of this program, what it is designed to measure or demonstrate. Then for each code block, describe what the block does and why it is needed to achieve the program's objective.
Task II: Flush+Reload on Leaking Program Behaviors
Now that you are familiar with cache behavior, which means you are ready to run the Flush+Reload attack (isn't it fun?!). The Flush+Reload attack is a powerful side-channel attack that exploits the timing differences between accessing data from the cache vs. main memory. By precisely measuring these timing differences, an attacker can infer sensitive information about a target program's behavior, such as secret keys.

Prerequisites: Navigate to the task folder prepared by the instructor (for you!). You can compile the code using the provided Makefile. The folder contains two programs—attack and multiply—as well as a shared library libops.so.

The multiply program performs a series of 2D matrix multiplication operations using the shared library. Do not worry about data input—the program automatically reads the two csv files, operand1.csv and operand2.csv, which contains the input matrices.

The attack.c file contains the implementation of our Flush+Reload attack. This program is designed to infer the number of 2D matrix multiplication operations performed by the multiply program. It uses the Flush+Reload technique to measure timing differences between accessing data from the cache and from main memory. You do not need to run the two programs manually. The instructor has prepared a run.sh script for you. Simply execute this script—it will automate the process and save the attack outputs in the traces folder. These steps can be done as follows:

                                $ make clean; make      // clean the previous build and compile the code
                                $ ./run.sh              // run the attack
                            
Task II-1 (4 pts). Let's first understand the Flush+Reload attack. Please answer the following questions:

  • What are these configurations and what do they do?
    
                                            Line 25: #define CPU_FREQ      2300000000
                                            Line 26: #define SECONDS       4
                                            Line 27: #define IDLE_SECONDS  1.0
                                            Line 29: #define RECORDS       CPU_FREQ / SLOT * SECONDS
                                            Line 30: #define SLOT          2500
                                            Line 31: #define THRESHOLD     110
                                            Line 32: #define MINTHRESHOLD  0
                                            Line 33: #define MAX_IDLE      CPU_FREQ / SLOT * IDLE_SECONDS
                                        
  • What does the following code do?
  • 
                                            161 uint16_t *res = (uint16_t *) malloc(RECORDS * _nmonitor * sizeof(uint16_t));
                                            162 for (int i = 0; i < RECORDS * _nmonitor ; i+= 4096/sizeof(uint16_t))
                                            163   res[i] = 1;
                                            164 fr_probe(fr, res);
                                            165
                                            166 // Trace the function calls
                                            167 int l = fr_trace(fr, RECORDS, res, SLOT, THRESHOLD, MAX_IDLE);
                                        
  • What does the following code do?
  • 
                                            156   for (int i = 0; i < _nmonitor; i++) {
                                            157     fprintf(stderr, " Searching [%2d] for [%.20s]: ", i, monitor[i]);
                                            158     uint64_t offset = sym_getsymboloffset(libfile, monitor[i]);
                                            159     if (offset == ~0ULL) {
                                            160       fprintf(stderr, "Error: cannot find the func. in [%s]\n", libfile);
                                            161       exit(1);
                                            162     }
                                            163     fr_monitor(fr, map_offset(libfile, offset));
                                            164     printf(": the func. offset [%10lx]\n", offset);
                                            165   }
                                        
  • Running the code will generate a .csv file in the traces folder. The file will contain lines of data in the following format, for example:
    
                                            0,0,109,hit,mul2D
                                            1,0,197,miss
                                            2,0,0,miss
                                            3,0,0,miss
                                            4,0,0,miss
                                            ...
                                        
    What do these numbers mean? Please explain the meaning of each column in the output file.
Task II-2 (5 pts). Now let's infer how many 2D matrix multiplication operations were performed by the multiply program. Note that in this task, extracting the exact number of operations is not the primary objective—you can simply reverse-engineer the multiply program to determine that. What's more important is to understand (1) the steps and configurations required to make the Flush+Reload attack work, and (2) how to analyze the outputs produced by the attack.

Start by opening the output traces stored under the traces folder. Check how many times a cache-hit occurs. The number of cache-hits can serve as an important clue for approximating how many 2D matrix multiplication operations were performed by the multiply program. You may not be able to find the exact number of operations. Please feel free to run the run.sh multiple times until the number of observed cache-hits closely matches the number of 2D matrix multiplication operations you reverse-engineered from the multiply program. You may need to use different configurations for the followings when you make it work:

                                Line 25: #define CPU_FREQ      2300000000
                                Line 26: #define SECONDS       4
                                Line 27: #define IDLE_SECONDS  1.0
                                Line 29: #define RECORDS       CPU_FREQ / SLOT * SECONDS
                                Line 30: #define SLOT          2500
                                Line 31: #define THRESHOLD     110
                                Line 32: #define MINTHRESHOLD  0
                                Line 33: #define MAX_IDLE      CPU_FREQ / SLOT * IDLE_SECONDS
                            
In the output .csv file, you may notice that most cache hits occur infrequently. However, in some cases, you may observe two (or more) cache hits occurring in close succession. In such cases, you may need to treat these multiple hits as a single hit.

Tip: You can use the grep command to filter the lines containing "hit" from the output file.

Now your report should answer the following questions:

  • How many 2D matrix multiplication operations were performed by the multiply program? What is the oracle?
  • Run the attack ten times. How many cache-hits occurred in each run?
  • What was the result from the most successful run and why?
  • What attack configurations were used in your most successful run?
  • What could be the reasons you were not able to obtain the correct number of cache-hits? Please explain your answer clearly and thoroughly—avoid vague responses, such as there are multiple programs running at the same time. If this is the reason, explain clearly how it could impact the Flush+Reload observations.
Extra Credit Opportunity (5 pts). You can earn extra credit by running the attack in a more fine-grained manner. To do so, you will need to modify the attack.c file accordingly. Please run the Flush+Reload attack with your modifications and answer the following questions:

  • How many element-wise multiplicaion occur in each 2D matrix multiplication?
  • Can you determine the width and height of the first matrix involved in the multiplication?
  • Can you determine the width and height of the second matrix involved in the multiplication?
  • Can you provide a detailed explanation of why it worked or did not work as expected?

Submission Instructions
Use Canvas to submit your homework. You need to make a single compressed file (.tar.gz. .tar or .zip) that contains your write-up as a PDF file. Your PDF write-up should contain the followings:

  • Task I (6 pts)
    • Task I-1: Your answers to the questions.
    • Task I-2: You need to provide a screenshot of your terminal showing the FR-threshold command and its output.
    • Task I-2: You need to provide a detailed explanation of the FR-threshold program, as requested above.
  • Task II (9 pts)
    • Task II-1: Your detailed answers to the questions.
    • Task II-2: Your detailed answers to the questions.
    • Task II-2: You should include the fr_traces folder containing the trace results (.csv files) from all ten runs you performed.
  • Extra Credits (5 pts)
    • Your detailed answers to the four questions.
    • You do not need to submit the traces for this opportunity.

    [Important Note] Make sure that your .csv file contains only cache-hits. Including all cache-misses will substantially increase the submission size—potentially exceeding 10MB. If your submission exceeds 10MB, 2 pts will be deducted.

Important Dates
  • Out: 05.26.2025 09:00 AM PT
  • Due: 06.09.2025 11:59 PM PT
Homework Overview

The learning objective of this homework is for you to understand and implement a prompt-based jailbreaking attack. Large-language model (LLM), such as GPT-4o and Claude-3.5, developers align their LLMs with human values, with one goal being the denial of harmful requests (e.g., asking "How do I build a bomb?"" should trigger a response such as "As an ethical AI model, I cannot answer this"). Prompt-based jailbreaks circumvent this alignment by modifying the input prompt to induce an LLM to respond to a harmful query it would normally refuse. You will implement and evaluate the Greedy Coordinate Gradient (GCG) attack: the most popular algorithm for crafting adversarial prompts for jailbreaking.

Initial Setup

The instructor and their PhD student (Zachary Coalson) has prepared a skeleton code for you to implement the GCG attack. The code is available on this GitHub repository. You can download the code by cloning the repository. The code is structured to help you implement the GCG attack and evaluate its effectiveness on a set of harmful queries.

[Important note] The code was developed in Linux, so the instructor recommends using a Linux machine. But the code should work for Mac and Windows machines as well. Also, a GPU is highly recommended for this project as LLMs are quite computationally intensive to run.

You will be using Python and PyTorch for this project. You need to install required packages. The instructor recommends using Miniconda to create the environment (install instructions here), though alternatives such as venv should also work. Assuming Miniconda is installed, create the developing environment by running the following commands:


                                    $ cd CS578-HW4
                                    $ conda create -n HW4 python=3.10
                                    $ conda activate HW4
                                    $ pip install -r requirements.txt
                                
[Note] Whenever returning to the project, be sure to re-activate the Conda environment by running $ conda activate HW4

Task I: Implement the GCG Attack
You first need to complete the provided skeleton code by implementing the evaluation script and the GCG attack.

Task I-1. Start with the evaluation script evaluate.py. This script is designed to load a language model, run harmful queries against it, and compute the attack success rate. Please follow the steps below:

  1. Load an LLM and a tokenizer. Use the HuggingFace API to load a language model and tokenizer. The LLM we will be using is Qwen2.5-1.5B, which is great for our purposes because it is relatively small, performant, and pretty well-aligned. The HuggingFace ID for this model is given in config.py and is already imported as the MODEL_NAME variable. You may find these guides helpful:


    Make sure to include the command .to(device) after loading the model in the script. This ensures that the model is moved to the appropriate device (e.g., GPU) for faster and more efficient computation.

  2. Load the harmful queries. This part is already done for you. We will use the harmful queries from JailbreakBench (published in NeurIPS 2024). Each query is a harmful request that an aligned model should refuse. You may notice the ADV_SUFFIXES_PATH variable is initialized to None. After we implement and run GCG, you will specify the path to the saved adversarial prompt-artifacts (a list of strings in a JSON object) to evaluate the GCG attack results. For now, you can leave it unmodified.

  3. Generate responses to the harmful queries. You make sure the followings:

    • Generate a maximum of 256 new tokens for each prompt.
    • Set sampling to False for deterministic responses.
    • Save the de-tokenized responses (in string format! Use the tokenizer's decode function) in one list, one item per prompt. Be sure to only save new tokens, not tokens from the input.

    Here's a reference for generating text with LLMs using HuggingFace.

  4. Compute the attack success rate. Jailbreak attacks are typically evaluated using the attack success rate metric: the fraction of inputs where alignment is successfully removed. For simplicity, we adopt the simple evaluation methodology from the GCG paper, which checks if a response starts with any prefix from a curated list that usually leads to a refusal. The curated list is provided for you, so you just need to implement a function that computes whether an output is harmful by checking if it DOES NOT start with any prefixes.

  5. Save the responses. This part is done for you. After generating the responses and computing the ASR, they are saved in the data directory so you can analyze them further.

    Upon completion of the evaluation script, you should be able to compute Qwen2.5-1.5B's direct response rate—the fraction of harmful queries it responds to before the jailbreak is applied. Feel free to play around with this. The config.py script specifies the number of samples to test, which you can set from 1-100.

How aligned is Qwen2.5-1.5B before adversarial pressure?

Task I-2. You now can evaluate a model on harmful queries using the script. The next step is to implement GCG to attempt a jailbreak.

Given a prompt and a target response (i.e., the harmful output we want the model to produce), GCG searches for an adversarial suffix that, when appended to the prompt, increases the likelihood that the model generates the target response. The full algorithm is provided in the gcg directory.

We will focus on implementing the loss function that GCG attempts the minimize by finding the optimal adversarial suffix. It is the standard next token prediction loss: the cross-entropy loss between logits (the model's output distribution across tokens) and labels (the target tokens we want the model to output). gcg/loss_function.py provides the skeleton function, which takes in the logits and labels and should return their loss. Some helpful hints:

  • Use the F.cross_entropy function from PyTorch.
  • The logits have initial shape [batch_size, sequence_length, vocab_size] and the labels [batch_size, sequence length]. The expected shapes are [batch_size * sequence_length, vocab_size] and [batch_size * sequence_length], respectively.
  • We want to retain the loss for each token in the sequence. This requires a specific reduction strategy.
  • After computing the loss, do not reshape it. The GCG algorithm will manage this. You can simply return it.
The primary challenge is to correctly reshape the logits and labels for the computation; the loss itself can be computed in one line. If you are curious, the GCG algorithm is described on pages 7-8 of the original paper, and the rest of the code provides much more detail.

Task I-3. With GCG implemented, let's prepare the script that runs it: run_gcg.py. The only thing you need to implement is loading the model and tokenizer, which is equivalent to what you did in evaluate.py.

Note that this time, we load harmful queries and their corresponding target responses from JailbreakBench. These targets do not contain a complete output, but rather the initial affirmative response. The insight from the GCG paper is that if the model outputs this affirmative response, it will continue to answer the harmful query. After running GCG, the adversarial suffixes are saved in data/suffixes.json as a list of strings. If you run GCG multiple times, rename these files to prevent older versions from being overwritten.

Task II: Jailbreak an LLM
Now you will use your GCG implementation to jailbreak Qwen2.5-1.5B. To start, ensure that NUM_SAMPLES in config.py is set to 1. Then run the following to craft an adversarial suffix for the first prompt in JailbreakBench:

                                $ python run_gcg.py
                            
The suffix will be stored in data/suffixes.json. Next, set this file path as the ADV_SUFFIXES_PATH variable in evaluate.py. Now, you can see if the adversarial suffix can jailbreak the LLM by running:

                                $ python evaluate.py
                            
The prompt, response, and harmfulness of the response should be saved in data/gcg/results.json. Include the answers to the following questions in your report:

  • Was the LLM "jailbroken" by GCG according to the ASR metric? Regardless of the metric, does the response seem aligned with human values? (Since this is one sample, it is just a True/False rather than a proportion.)
  • Does the beginning of the model's response match the target from JailbreakBench? In other words, does the optimization appear successful? (You can print it out from run_gcg.py.)

Extra Credit Opportunity: More Jailbreaking (5 pts).
Now that you have hopefully carried out a successful jailbreak, let's craft adversarial suffixes for more prompts that we can analyze. Repeat the steps from Task II, but change NUM_SAMPLES in config.py to 10. Please feel free to do more (up to 100) if you have the compute, but 10 should be sufficient.

After, you should have a list of 10 adversarial suffixes in data/suffixes.json, and 10 pairs of results in data/gcg/results.json. For reference, you should also compute the model's direct response rate without a jailbreak. You can compute this by running evaluate.py after setting ADV_SUFFIXES_PATH to None—the results will be saved in data/no_attack/results.json.

Include the answers to the following questions in your report:

  • What is the ASR? What is the direct response rate? Does GCG successfully increase the number of harmful responses?
  • Analyze the prompts, targets, and responses. When and why can GCG induce a harmful completion? When and why does it appear to fail (if at all)?
  • Do you notice a connection between the adversarial suffix and the harmful response? Does the model appear to reference the content in any way? Using insights from this, why might the adversarial suffix bypass the model's alignment?

Submission Instructions
Use Canvas to submit your homework. You need to make a single compressed file (.tar.gz. .tar or .zip) that contains your entire code and write-up as a PDF file. Your PDF write-up should contain the followings:

  • Task I (9 pts)
    • Task I-1: Your evaluation.py and the model response(s).
    • Task I-2: Your gcg/loss_function.py.
    • Task I-3: Your run_gcg.py and the attack result(s).
  • Task II (6 pts)
    • Your responses stored in data/gcg/results.json.
    • Your detailed answers to the questions.
  • Extra Credits (5 pts)
    • Your responses stored in data/gcg/results.json.
    • Your detailed answers to the questions.
[Important note] Make sure not to include any large files, such as the model file, in your submission. If your submission includes any large files, 2 pts will be deduced from your total score.