Modern neural networks require many data points and huge computational resources to train. Suppose that you're training the recent ChatGPT models. You need to throw exascale data and use thousands of GPU hours for training until you reach a reasonable accuracy. Of course, this is not a problem once you have sufficient training resources like OpenAI or Google. But if not, the problem starts happening as we need to outsource some of the jobs required for training to 3rd-parties. You can ask them to collect your data from various sources, train models on your data (this is true when you ask AWS/Google to train your models), or even download models trained by somebody else from the Internet. As we can not easily guarantee none of those supply-chain actions are benign, we do not believe the final models are produced by the outsourcing parties.
It raises concerns about backdoor attacks. In backdooring, originally introduced by Gu et al. [1], the 3rd-parties (potential adversaries) who train and deliver models to you can hide "malicious behaviors" only triggered by their specific inputs. As an example, an adversary who supplies models to your self-driving vehicles can make these models recognize "stop-signs" to "minimum speed limit signs" when the attacker attaches a yellow sticky note to it. Otherwise, the models work perfectly fine so we as a victim are not easy to identify whether they are backdoored or not.
You're highly encouraged to opt-in to this extra credit opportunity.
You will do security games with Sanghyun. You will be working as a supply-chain attacker who wants to inject (or to "hide") malicious behaviors into the neural networks that you're serving. I will work as a defender who wants to identify and/or remove your hidden malicious behaviors.
Over this term, I will ask you to train reasonably well-working models for several tasks (such as MNIST or CIFAR10). During training these models, you can insert any malicious behaviors into them. You feel free to choose any attacks from the literature [papers on backdooring]. One simplest example would be: we can make a model classify input photos with a yellow square on the right bottom as "guacamole," no matter what the background contents are. My job here would be to know whether you injected any malicious behavior, what would be the trigger pattern (i.e., a yellow square on the right bottom, and what your intention would be (i.e., target class like "guacamole").
In the first two weeks of this term, I will provide the datasets, code, and some starter kits where you can start your attacks on me. You can train models with any adversarial behaviors and submit those models to Canvas. Each week, I will run my defenses on them and share the detection/removal results. Announcements, leaderboards and tentative details about the competition rounds are below:
Coming soon, no challenge is there yet!