Loading…

Novel Advance Image Caption Generation Utilizing Vision Transformer and Generative Adversarial Networks

In this paper, we propose a novel method for producing image captions through the utilization of Generative Adversarial Networks (GANs) and Vision Transformers (ViTs) using our proposed Image Captioning Utilizing Transformer and GAN (ICTGAN) model. Here we use the efficient representation learning o...

Full description

Saved in:

Bibliographic Details
Published in:	Computers (Basel) 2024-12, Vol.13 (12), p.305
Main Authors:	Tyagi, Shourya, Oki, Olukayode Ayodele, Verma, Vineet, Gupta, Swati, Vijarania, Meenu, Awotunde, Joseph Bamidele, Babatunde, Abdulrauph Olanrewaju
Format:	Article
Language:	English
Subjects:	Algorithms Artificial intelligence Attention Blindness Comparative analysis Computational linguistics Computer vision Datasets Deep learning Electric transformers Generative adversarial networks image caption generation Image quality Image segmentation Language Language processing Large language models Liquors Machine learning Machine vision Medical imaging equipment Methods MS COCO multi-head self-attention model Natural language interfaces Natural language processing Neural networks Production methods Semantics Vision vision transformer
Citations:	Items that this one cites
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	In this paper, we propose a novel method for producing image captions through the utilization of Generative Adversarial Networks (GANs) and Vision Transformers (ViTs) using our proposed Image Captioning Utilizing Transformer and GAN (ICTGAN) model. Here we use the efficient representation learning of the ViTs to improve the realistic image production of the GAN. Using textual features from the LSTM-based language model, our proposed model combines salient information extracted from images using ViTs. This merging of features is made possible using a self-attention mechanism, which enables the model to efficiently take in and process data from both textual and visual sources using the self-attention properties of the self-attention mechanism. We perform various tests on the MS COCO dataset as well as the Flickr30k dataset, which are popular benchmark datasets for image-captioning tasks, to verify the effectiveness of our proposed model. The outcomes represent that, on this dataset, our algorithm outperforms other approaches in terms of relevance, diversity, and caption quality. With this, our model is robust to changes in the content and style of the images, demonstrating its excellent generalization skills. We also explain the benefits of our method, which include better visual–textual alignment, better caption coherence, and better handling of complicated scenarios. All things considered, our work represents a significant step forward in the field of picture caption creation, offering a complete solution that leverages the complementary advantages of GANs and ViT-based self-attention models. This work pushes the limits of what is currently possible in image caption generation, creating a new standard in the industry.
ISSN:	2073-431X 2073-431X
DOI:	10.3390/computers13120305