Loading…

BaSFormer: A Balanced Sparsity Regularized Attention Network for Transformer

Attention networks often make decisions relying solely on a few pieces of tokens, even if those reliances are not truly indicative of the underlying meaning or intention of the full context. This can lead to over-fitting in transformers and hinder their ability to generalize. Attention regularizatio...

Full description

Saved in:
Bibliographic Details
Published in:IEEE/ACM transactions on audio, speech, and language processing speech, and language processing, 2024, Vol.32, p.2125-2140
Main Authors: Jiang, Shuoran, Chen, Qingcai, Xiang, Yang, Pan, Youcheng, Wu, Xiangping
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Attention networks often make decisions relying solely on a few pieces of tokens, even if those reliances are not truly indicative of the underlying meaning or intention of the full context. This can lead to over-fitting in transformers and hinder their ability to generalize. Attention regularization and sparsity-based methods have been used to overcome this issue. However, these methods cannot guarantee that all tokens have sufficient receptive fields for global information inference. Thus, the impact of individual biases cannot be effectively reduced. As a result, the generalization of these approaches improved slightly from the training data to new data. To address these limitations, we propose a balanced sparsity (BaS) regularized attention network on top of the transformers, called BaSFormer. BaS regularization introduces the K-regular graph constraint on self-attention connections, which replaces SoftMax with SparseMax in the attention transformation. In BaS-regularized self-attention, SparseMax assigns zero attention scores to low-scoring connections, highlighting influential and meaningful contexts. The K-regular graph constraint ensures that all tokens have an equal-sized receptive field to aggregate information, which facilitates the involvement of global tokens in the feature update of each layer and reduces the impact of individual biases. Given that there is no continuous loss can be used for the K-regular graph regularization, we propose an exponential extremum loss with an augmented Lagrangian function. The experimental results showed that BaSFormer improved the effectiveness of debiasing compared to that of the newest LLMs, such as the GPT-3.5, GPT-4 and LLaMA. In addition, BaSFormer achieves new state-of-the-art (SOTA) results in text generation tasks. Interestingly, this work also shows that BaSFormer can learn hierarchical linguistic dependencies in gradient attributions, which improves interpretability and adversarial robustness.
ISSN:2329-9290
2329-9304
DOI:10.1109/TASLP.2024.3374062