Loading…

Enhanced Image Captioning Using Bahdanau Attention Mechanism and Heuristic Beam Search Algorithm

Captioning images is a challenging task at the intersection of Computer Vision (CV) and Natural Language Processing (NLP), that involves generating descriptive text to depict the content of an image. Existing methodologies typically employ Convolutional Neural Networks (CNNs) for feature extraction...

Full description

Saved in:

Bibliographic Details
Published in:	IEEE access 2024, Vol.12, p.100991-101003
Main Authors:	Abinaya, S., Deepak, Mandava, Sherly Alphonse, A.
Format:	Article
Language:	English
Subjects:	Attention mechanism Attention mechanisms Bahdanau attention beam search BLEU score Decoding Feature extraction image captioning Image capture Long short term memory LSTM Residual neural networks ResNet50 Vectors VisualCaptionNet (VCN) Visualization
Citations:	Items that this one cites
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	Captioning images is a challenging task at the intersection of Computer Vision (CV) and Natural Language Processing (NLP), that involves generating descriptive text to depict the content of an image. Existing methodologies typically employ Convolutional Neural Networks (CNNs) for feature extraction and Recurrent Neural Networks (RNNs) for generating captions. However, these approaches often suffer from a lack of contextual understanding, inability to capture fine-grained details, and to generate generic captions. This study proposes VisualCaptionNet (VCN), a novel image captioning model that leverages ResNet50 for rich visual feature extraction and a Long Short-Term Memory (LSTM) network for sequential caption generation while retaining context. By incorporating the Bahdanau attention mechanism to focus on relevant image regions and integrating beam search for coherent and contextually relevant descriptions, VCN addresses the limitations of previous methodologies. Extensive experimentation on benchmark datasets such as Flickr30K and Flickr8K demonstrates VCN's notable improvements of 10% and 12% over baseline models in terms of caption quality, coherence, and relevance. These enhancements emphasize VCN's effectiveness in advancing image captioning tasks, promising more accurate and contextually relevant descriptions for images.
ISSN:	2169-3536 2169-3536
DOI:	10.1109/ACCESS.2024.3431091