Loading…

BadNL: Backdoor Attacks against NLP Models with Semantic-preserving Improvements

Deep neural networks (DNNs) have progressed rapidly during the past decade and have been deployed in various real-world applications. Meanwhile, DNN models have been shown to be vulnerable to security and privacy attacks. One such attack that has attracted a great deal of attention recently is the b...

Full description

Saved in:

Bibliographic Details
Published in:	arXiv.org 2021-10
Main Authors:	Chen, Xiaoyi, Salem, Ahmed, Chen, Dingfan, Backes, Michael, Ma, Shiqing, Shen, Qingni, Wu, Zhonghai, Zhang, Yang
Format:	Article
Language:	English
Subjects:	Accuracy Computer vision Cybersecurity Machine learning Natural language processing Poisons Words (language)
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	Deep neural networks (DNNs) have progressed rapidly during the past decade and have been deployed in various real-world applications. Meanwhile, DNN models have been shown to be vulnerable to security and privacy attacks. One such attack that has attracted a great deal of attention recently is the backdoor attack. Specifically, the adversary poisons the target model's training set to mislead any input with an added secret trigger to a target class. Previous backdoor attacks predominantly focus on computer vision (CV) applications, such as image classification. In this paper, we perform a systematic investigation of backdoor attack on NLP models, and propose BadNL, a general NLP backdoor attack framework including novel attack methods. Specifically, we propose three methods to construct triggers, namely BadChar, BadWord, and BadSentence, including basic and semantic-preserving variants. Our attacks achieve an almost perfect attack success rate with a negligible effect on the original model's utility. For instance, using the BadChar, our backdoor attack achieves a 98.9% attack success rate with yielding a utility improvement of 1.5% on the SST-5 dataset when only poisoning 3% of the original set. Moreover, we conduct a user study to prove that our triggers can well preserve the semantics from humans perspective.
ISSN:	2331-8422
DOI:	10.48550/arxiv.2006.01043