Aditya Bharat Soni

Computer Science and Engineering | IIT Kharagpur

Bachelor’s Thesis Project | Aditya Bharat Soni

Bachelor's Thesis Project

About

Given an instruction fine-tuned model with black-box query access, the objective is to learn input-agnostic trigger phrases without assuming access to the model’s internals like parameters, training data, architecture, etc. Furthermore, we investigate whether the triggers transfer to other tasks as well. The attacker’s intent is to degrade the model’s performance on various tasks.

Why are task-agnostic triggers concerning?

Recent development of instruction fine-tuned models has shown the feasibility of having a general purpose LLM which is capable of doing numerous tasks based on the input prompt (instruction). These models are also being deployed in many applications including chat-bots. The presence of a single trigger phrase that works for an arbitrary input sample of an arbitrary task presents a serious security vulnerability. Furthermore, the only assumption required to learn these triggers is access to the outputs (not logits) generated by the model.

Code and Research Paper

Will be released after our paper gets accepted.