Loading…

Transferable adversarial distribution learning: Query-efficient adversarial attack against large language models

It is a challenging task to fool a text classifier based on deep neural networks under the black-box setting where the target model can only be queried. Among the existing black-box attacks, decision-based methods have a large query cost due to exponential perturbation space and greedy search strate...

Full description

Saved in:
Bibliographic Details
Published in:Computers & security 2023-12, Vol.135, p.103482, Article 103482
Main Authors: Dong, Huoyuan, Dong, Jialiang, Wan, Shaohua, Yuan, Shuai, Guan, Zhitao
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Items that cite this one
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:It is a challenging task to fool a text classifier based on deep neural networks under the black-box setting where the target model can only be queried. Among the existing black-box attacks, decision-based methods have a large query cost due to exponential perturbation space and greedy search strategy. Transfer-based methods, on the other hand, tend to overfit the surrogate model and thus fail when applied to unknown target models. In this paper, we propose a straightforward yet highly effective adversarial attack framework for black-box transformer-based models, thereby exposing vulnerabilities within large language models. Specifically, we leverage a fine-tuned large language model as a white-box surrogate model and optimize a distribution of adversarial text. This distribution is parameterized by a continuous-valued matrix based upon the surrogate model. To avoid overfitting of the distribution and improve its adversarial transferability, we incorporate an additional causal language model into our framework as a constraint model. Based on this constraint model, we add language model perplexity and semantic consistency as regularization terms during the distribution training process. To further reduce the number of queries to the target model, i.e., improve the threat level of examples drawn from our distribution, we employ a geometric loss strategy to ensure that the distribution training process learns the optimal perturbation. Extensive experimental studies have been carried out on benchmark datasets and the results demonstrate significant improvement on the performance and query efficiency under black-box setting in comparison with well-established approaches. Our approach achieves an 80.98% reduction in BERT model accuracy while consuming only 21.86% of the query times required by prior attacks.
ISSN:0167-4048
DOI:10.1016/j.cose.2023.103482