Loading…

A real-time image captioning framework using computer vision to help the visually impaired

Advancements in image captioning technology have played a pivotal role in enhancing the quality of life for those with visual impairments, fostering greater social inclusivity. The computer vision and natural language processing methods enhances the accessibility and comprehensibility of pictures vi...

Full description

Saved in:
Bibliographic Details
Published in:Multimedia tools and applications 2023-12, Vol.83 (20), p.59413-59438
Main Authors: Safiya, K. M., Pandian, R.
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Advancements in image captioning technology have played a pivotal role in enhancing the quality of life for those with visual impairments, fostering greater social inclusivity. The computer vision and natural language processing methods enhances the accessibility and comprehensibility of pictures via the addition of textual descriptions. Significant advancements have been achieved in photo captioning, specifically tailored for those with visual impairments. Nevertheless, some challenges must be addressed, like ensuring the precision of automatically generated captions and effectively handling pictures that include many objects or settings. This research presents a ground breaking architecture for real-time picture captioning using a VGG16-LSTM deep learning model with computer vision assistance. The framework has been developed and deployed in a Raspberry Pi 4B single-board computer, with graphics processing unit capabilities. This implementation allows for the automated generation of relevant captions for photographs captured in real time by a NoIR camera module. This characteristic makes it a portable and uncomplicated choice for those with visual impairments. The efficacy of the VGG16-LSTM deep learning model is evaluated via comprehensive testing, including both sighted and visually impaired participants in diverse settingsThe experimental findings demonstrate that the proposed framework effectively operates as intended, generating real-time picture captions that are accurate and contextually appropriate. The analysis of user feedback indicates a significant improvement in the understanding of visual content, hence facilitating the mobility and interaction of individuals with visual impairments in their environment. We have used multiple dataset including Flicke8k, Flickr30k, VizWiz captioning and custom dataset for the model training, validation and testing process. During the training phase, the ResNet-50 and VGG-16 models achieve 80.84% and 84.13% accuracy, respectively. Similarly, during the validation phase, the ResNet-50 and VGG-16 models acquire accuracies of 80.04% and 83.88%, respectively. The text-to-speech API is analyzed with MOS and WER matrices and achieved an exceptional accuracy and performance is verifying on a GPU system using a custom dataset. The efficacy of the VGG16-LSTM deep-learning model is evaluated using six metrics: accuracy, precision, recall, F1 score, BLEU, and ROUGE. Individuals with visual impairments may benefit from t
ISSN:1380-7501
1573-7721
DOI:10.1007/s11042-023-17849-7