Loading…
Deep Learning based Holistic Speaker Independent Visual Speech Recognition
From a broader perspective, the objective of Visual Speech Recognition (VSR) is to comprehend the speech spoken by an individual using visual deformations. However, some of the significant limitations of existing solutions include the dearth of training data, improper end-to-end deployed solutions,...
Saved in:
Published in: | IEEE transactions on artificial intelligence 2023-12, Vol.4 (6), p.1-10 |
---|---|
Main Authors: | , , , , |
Format: | Article |
Language: | English |
Subjects: | |
Citations: | Items that this one cites Items that cite this one |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | From a broader perspective, the objective of Visual Speech Recognition (VSR) is to comprehend the speech spoken by an individual using visual deformations. However, some of the significant limitations of existing solutions include the dearth of training data, improper end-to-end deployed solutions, lack of holistic feature representation, and less accuracy. To resolve these limitations, this study proposes a novel, scalable, and robust VSR system that uses the videotape of the user to determine the word which is being spoken. In this regard, a customized 3-Dimensional Convolutional Neural Network (3D CNN) architecture is proposed by extracting the Spatio-temporal features and eventually mapping the prediction probabilities of the elements in the corpus. We have created a customized dataset resembling the metadata contained in the MIRACL-VC1 dataset to validate the concept of person-independence. While being robust to a broad spectrum of lighting conditions across multiple devices, our model achieves a training accuracy of 80.2% and a testing accuracy of 77.9% in predicting the word spoken by the user. |
---|---|
ISSN: | 2691-4581 2691-4581 |
DOI: | 10.1109/TAI.2022.3220190 |