Loading…

BSAM: Research on image-text matching method based on Bert and self-attention mechanism

Image-text matching plays a crucial role in connecting vision and language. The details of the objects in the image, the positional relationship, and the correspondence between the background and the text description are the keys to image-text matching. Previous studies either only extract the salie...

Full description

Saved in:

Bibliographic Details
Main Authors:	Wei, Jishu, Sun, Tao, Quan, Zhibang, Su, Mengli, Zhang, Zihao, Zhong, Shenjie
Format:	Conference Proceeding
Language:	English
Subjects:	Data models detail features Feature extraction Filtering Image segmentation image-text matching Interference Matched filters self-attention mechanism Semantics visual features
Online Access:	Request full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	Image-text matching plays a crucial role in connecting vision and language. The details of the objects in the image, the positional relationship, and the correspondence between the background and the text description are the keys to image-text matching. Previous studies either only extract the salient objects of the image, or only pay attention to the location of the object, ignoring the detailed features and background features of the object, and the extraction of the overall semantic information of the image is not comprehensive enough. Accordingly, this paper proposes a model based on Bert and Self-Attention Mechanism (BSAM), we segment the image area, use the self-attention mechanism to enhance the weight of the key area, pay attention to each object and their detailed features and background features, the image regions are mapped into original region features and new features with other region relationships, and the global information of the image is inferred based on the relationship between each region and background features. The text extracts word features and new features with other word relationships through the Bert model. We propose the Cross-Attention and Similarity -Attention Filtering (CA-SAF) module to align all relevant image regions and words, enhance matching pairs with high weights, and filter matching pairs with lower weights. Extensive experiments on two datasets, Flickr30K and MS COCO, show that the BSAM model significantly outperforms state-of-the-art methods.
ISSN:	2577-1655
DOI:	10.1109/SMC53654.2022.9945109