Loading…

Thangka image captioning model with Salient Attention and Local Interaction Aggregator

Thangka image captioning aims to automatically generate accurate and complete sentences that describe the main content of Thangka images. However, existing methods fall short in capturing the features of the core deity regions and the surrounding background details of Thangka images, and they signif...

Full description

Saved in:
Bibliographic Details
Published in:Heritage science 2024-11, Vol.12 (1), p.407-21, Article 407
Main Authors: Hu, Wenjin, Zhang, Fujun, Zhao, Yinqiu
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Thangka image captioning aims to automatically generate accurate and complete sentences that describe the main content of Thangka images. However, existing methods fall short in capturing the features of the core deity regions and the surrounding background details of Thangka images, and they significantly lack an understanding of local actions and interactions within the images. To address these issues, this paper proposes a Thangka image captioning model based on Salient Attention and Local Interaction Aggregator (SALIA). The model is designed with a Dual-Branch Salient Attention Module (DBSA) to accurately capture the expressions, decorations of the deity, and descriptive background elements, and it introduces a Local Interaction Aggregator (LIA) to achieve detailed analysis of the characters’ actions, facial expressions, and the complex interactions with surrounding elements in Thangka images. Experimental results show that SALIA outperforms other state-of-the-art methods in both qualitative and quantitative evaluations of Thangka image captioning, achieving BLEU4: 94.0%, ROUGE_L: 95.0%, and CIDEr: 909.8% on the D-Thangka dataset, and BLEU4: 22.2% and ROUGE_L: 47.2% on the Flickr8k dataset.
ISSN:2050-7445
2050-7445
DOI:10.1186/s40494-024-01518-5