Loading…

Contextual and selective attention networks for image captioning

The steady momentum of innovations has convincingly demonstrated the high capability of attention mechanisms for the sequence to sequence learning. Nevertheless, the computation of attention across a sequence is often independent in either hard or soft mode, thereby resulting in undesired effects su...

Full description

Saved in:
Bibliographic Details
Published in:Science China. Information sciences 2022-12, Vol.65 (12), p.222103, Article 222103
Main Authors: Wang, Jing, Li, Yehao, Pan, Yingwei, Yao, Ting, Tang, Jinhui, Mei, Tao
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Items that cite this one
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:The steady momentum of innovations has convincingly demonstrated the high capability of attention mechanisms for the sequence to sequence learning. Nevertheless, the computation of attention across a sequence is often independent in either hard or soft mode, thereby resulting in undesired effects such as repeated modeling. In this paper, we introduce a new design to holistically explore the interdependencies between attention histories and locally emphasize the strong focus of each attention on image captioning. Specifically, we present a contextual and selective attention network (namely CoSA-Net) that novelly memorizes contextual attention and brings out the principal components from each attention. Technically, CoSA-Net writes/updates the attended image region features into memory and reads from memory when measuring attention in the next time step to leverage contextual knowledge. Only the regions with the top- k highest attention scores are selected, and each region feature is individually employed to compute an output distribution. The final output is an attention-weighted mixture of all k distributions. In turn, the attention is then upgraded by the posterior distribution conditioned on the output. Our CoSA-Net is appealing given that it is pluggable to the sentence decoder in any neural captioning model. Extensive experiments on the COCO image captioning dataset demonstrate the superiority of CoSA-Net. More remarkably, integrating CoSA-Net to a one-layer long short-term memory (LSTM) decoder increases CIDEr-D performance from 125.2% to 128.5% on the COCO Karpathy test split. When further endowing a two-layer LSTM decoder with CoSA-Net, the CIDEr-D score is boosted to 129.5%.
ISSN:1674-733X
1869-1919
DOI:10.1007/s11432-020-3523-6