Loading…

Using Lip Reading Recognition to Predict Daily Mandarin Conversation

Audio-based automatic speech recognition as a hearing aid is susceptible to background noise and overlapping speeches. Consequently, audio-visual speech recognition has been developed to complement the audio input with additional visual information. However, the huge improvement of neural networks i...

Full description

Saved in:
Bibliographic Details
Published in:IEEE access 2022, Vol.10, p.53481-53489
Main Authors: Haq, Muhamad Amirul, Ruan, Shanq-Jang, Cai, Wen-Jie, Li, Lieber Po-Hung
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Items that cite this one
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
cited_by cdi_FETCH-LOGICAL-c338t-ac872399af2394bec1e0431e92ce8ff57c5efff4522fa05da972ab76ead25dea3
cites cdi_FETCH-LOGICAL-c338t-ac872399af2394bec1e0431e92ce8ff57c5efff4522fa05da972ab76ead25dea3
container_end_page 53489
container_issue
container_start_page 53481
container_title IEEE access
container_volume 10
creator Haq, Muhamad Amirul
Ruan, Shanq-Jang
Cai, Wen-Jie
Li, Lieber Po-Hung
description Audio-based automatic speech recognition as a hearing aid is susceptible to background noise and overlapping speeches. Consequently, audio-visual speech recognition has been developed to complement the audio input with additional visual information. However, the huge improvement of neural networks in the visual task has resulted in a robust and reliable lip reading framework that can recognize speech from visual input alone. In this work, we propose a lip reading recognition model to predict daily Mandarin conversation and collect a new Daily Mandarin Conversation Lip Reading (DMCLR) dataset, consisting of 1,000 videos from 100 daily conversations spoken by ten speakers. Our model consists of a spatiotemporal convolution layer, a SE-ResNet-18 network, and a back-end module consisting of bi-directional gated recurrent unit (Bi-GRU), 1D convolution, and fully-connected layers. This model is able to reach 94.2% of accuracy in the DMCLR dataset. Such performance makes it possible for Mandarin lip reading applications to be practical in real life. Additionally, we are able to achieve 86.6% and 57.2% accuracy on Lip Reading in the Wild (LRW) and LRW-1000 (Mandarin), respectively. The results show that our method achieves state-of-the-art performance on these two challenging datasets.
doi_str_mv 10.1109/ACCESS.2022.3175867
format article
fullrecord <record><control><sourceid>proquest_doaj_</sourceid><recordid>TN_cdi_doaj_primary_oai_doaj_org_article_887bfb741ad7446389e9a6def3aedf2a</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>9777678</ieee_id><doaj_id>oai_doaj_org_article_887bfb741ad7446389e9a6def3aedf2a</doaj_id><sourcerecordid>2669158870</sourcerecordid><originalsourceid>FETCH-LOGICAL-c338t-ac872399af2394bec1e0431e92ce8ff57c5efff4522fa05da972ab76ead25dea3</originalsourceid><addsrcrecordid>eNpNUE1Lw0AQDaKgVH-Bl4Dn1uxu9uso8atQUVo9L5Pd2bKlZusmFfrvTYwU5zDzGN57M7wsuybFjJBC395V1cNqNaMFpTNGJFdCnmQXlAg9ZZyJ03_4PLtq203Rl-pXXF5k9x9taNb5IuzyJYIb8BJtXDehC7HJu5i_JXTBdvk9hO0hf4HGQQpNXsXmG1MLA-0yO_OwbfHqb06yj8eH9-p5unh9mld3i6llTHVTsEpSpjX4vpc1WoJFyQhqalF5z6Xl6L0vOaUeCu5ASwq1FP1flDsENsnmo6-LsDG7FD4hHUyEYH4XMa0NpC7YLRqlZO1rWRJwsiwFUxo1CIeeATpPB6-b0WuX4tce285s4j41_fuGCqEJ7x2KnsVGlk2xbRP641VSmCF9M6ZvhvTNX_q96npUBUQ8KrSUUkjFfgAdRYES</addsrcrecordid><sourcetype>Open Website</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2669158870</pqid></control><display><type>article</type><title>Using Lip Reading Recognition to Predict Daily Mandarin Conversation</title><source>Linguistics and Language Behavior Abstracts (LLBA)</source><source>IEEE Xplore Open Access Journals</source><creator>Haq, Muhamad Amirul ; Ruan, Shanq-Jang ; Cai, Wen-Jie ; Li, Lieber Po-Hung</creator><creatorcontrib>Haq, Muhamad Amirul ; Ruan, Shanq-Jang ; Cai, Wen-Jie ; Li, Lieber Po-Hung</creatorcontrib><description>Audio-based automatic speech recognition as a hearing aid is susceptible to background noise and overlapping speeches. Consequently, audio-visual speech recognition has been developed to complement the audio input with additional visual information. However, the huge improvement of neural networks in the visual task has resulted in a robust and reliable lip reading framework that can recognize speech from visual input alone. In this work, we propose a lip reading recognition model to predict daily Mandarin conversation and collect a new Daily Mandarin Conversation Lip Reading (DMCLR) dataset, consisting of 1,000 videos from 100 daily conversations spoken by ten speakers. Our model consists of a spatiotemporal convolution layer, a SE-ResNet-18 network, and a back-end module consisting of bi-directional gated recurrent unit (Bi-GRU), 1D convolution, and fully-connected layers. This model is able to reach 94.2% of accuracy in the DMCLR dataset. Such performance makes it possible for Mandarin lip reading applications to be practical in real life. Additionally, we are able to achieve 86.6% and 57.2% accuracy on Lip Reading in the Wild (LRW) and LRW-1000 (Mandarin), respectively. The results show that our method achieves state-of-the-art performance on these two challenging datasets.</description><identifier>ISSN: 2169-3536</identifier><identifier>EISSN: 2169-3536</identifier><identifier>DOI: 10.1109/ACCESS.2022.3175867</identifier><identifier>CODEN: IAECCG</identifier><language>eng</language><publisher>Piscataway: IEEE</publisher><subject>Audio data ; Automatic speech recognition ; Background noise ; Conversation ; Convolution ; Datasets ; Deep learning ; Feature extraction ; hearing aid ; Hearing aids ; Hidden Markov models ; Lip reading ; Lipreading ; Lips ; Mandarin ; Mandarin lip reading ; Neural networks ; Oral communication ; Reading ; speech aid ; Speech recognition ; Tuition ; Videos ; visual speech recognition ; Visual tasks ; Visualization ; Voice recognition</subject><ispartof>IEEE access, 2022, Vol.10, p.53481-53489</ispartof><rights>Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2022</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c338t-ac872399af2394bec1e0431e92ce8ff57c5efff4522fa05da972ab76ead25dea3</citedby><cites>FETCH-LOGICAL-c338t-ac872399af2394bec1e0431e92ce8ff57c5efff4522fa05da972ab76ead25dea3</cites><orcidid>0000-0003-4799-4105 ; 0000-0003-1516-0229 ; 0000-0003-1075-8512</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/9777678$$EHTML$$P50$$Gieee$$Hfree_for_read</linktohtml><link.rule.ids>314,780,784,4024,27633,27923,27924,27925,31269,54933</link.rule.ids></links><search><creatorcontrib>Haq, Muhamad Amirul</creatorcontrib><creatorcontrib>Ruan, Shanq-Jang</creatorcontrib><creatorcontrib>Cai, Wen-Jie</creatorcontrib><creatorcontrib>Li, Lieber Po-Hung</creatorcontrib><title>Using Lip Reading Recognition to Predict Daily Mandarin Conversation</title><title>IEEE access</title><addtitle>Access</addtitle><description>Audio-based automatic speech recognition as a hearing aid is susceptible to background noise and overlapping speeches. Consequently, audio-visual speech recognition has been developed to complement the audio input with additional visual information. However, the huge improvement of neural networks in the visual task has resulted in a robust and reliable lip reading framework that can recognize speech from visual input alone. In this work, we propose a lip reading recognition model to predict daily Mandarin conversation and collect a new Daily Mandarin Conversation Lip Reading (DMCLR) dataset, consisting of 1,000 videos from 100 daily conversations spoken by ten speakers. Our model consists of a spatiotemporal convolution layer, a SE-ResNet-18 network, and a back-end module consisting of bi-directional gated recurrent unit (Bi-GRU), 1D convolution, and fully-connected layers. This model is able to reach 94.2% of accuracy in the DMCLR dataset. Such performance makes it possible for Mandarin lip reading applications to be practical in real life. Additionally, we are able to achieve 86.6% and 57.2% accuracy on Lip Reading in the Wild (LRW) and LRW-1000 (Mandarin), respectively. The results show that our method achieves state-of-the-art performance on these two challenging datasets.</description><subject>Audio data</subject><subject>Automatic speech recognition</subject><subject>Background noise</subject><subject>Conversation</subject><subject>Convolution</subject><subject>Datasets</subject><subject>Deep learning</subject><subject>Feature extraction</subject><subject>hearing aid</subject><subject>Hearing aids</subject><subject>Hidden Markov models</subject><subject>Lip reading</subject><subject>Lipreading</subject><subject>Lips</subject><subject>Mandarin</subject><subject>Mandarin lip reading</subject><subject>Neural networks</subject><subject>Oral communication</subject><subject>Reading</subject><subject>speech aid</subject><subject>Speech recognition</subject><subject>Tuition</subject><subject>Videos</subject><subject>visual speech recognition</subject><subject>Visual tasks</subject><subject>Visualization</subject><subject>Voice recognition</subject><issn>2169-3536</issn><issn>2169-3536</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2022</creationdate><recordtype>article</recordtype><sourceid>ESBDL</sourceid><sourceid>7T9</sourceid><sourceid>DOA</sourceid><recordid>eNpNUE1Lw0AQDaKgVH-Bl4Dn1uxu9uso8atQUVo9L5Pd2bKlZusmFfrvTYwU5zDzGN57M7wsuybFjJBC395V1cNqNaMFpTNGJFdCnmQXlAg9ZZyJ03_4PLtq203Rl-pXXF5k9x9taNb5IuzyJYIb8BJtXDehC7HJu5i_JXTBdvk9hO0hf4HGQQpNXsXmG1MLA-0yO_OwbfHqb06yj8eH9-p5unh9mld3i6llTHVTsEpSpjX4vpc1WoJFyQhqalF5z6Xl6L0vOaUeCu5ASwq1FP1flDsENsnmo6-LsDG7FD4hHUyEYH4XMa0NpC7YLRqlZO1rWRJwsiwFUxo1CIeeATpPB6-b0WuX4tce285s4j41_fuGCqEJ7x2KnsVGlk2xbRP641VSmCF9M6ZvhvTNX_q96npUBUQ8KrSUUkjFfgAdRYES</recordid><startdate>2022</startdate><enddate>2022</enddate><creator>Haq, Muhamad Amirul</creator><creator>Ruan, Shanq-Jang</creator><creator>Cai, Wen-Jie</creator><creator>Li, Lieber Po-Hung</creator><general>IEEE</general><general>The Institute of Electrical and Electronics Engineers, Inc. (IEEE)</general><scope>97E</scope><scope>ESBDL</scope><scope>RIA</scope><scope>RIE</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>7SP</scope><scope>7SR</scope><scope>7T9</scope><scope>8BQ</scope><scope>8FD</scope><scope>JG9</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><scope>DOA</scope><orcidid>https://orcid.org/0000-0003-4799-4105</orcidid><orcidid>https://orcid.org/0000-0003-1516-0229</orcidid><orcidid>https://orcid.org/0000-0003-1075-8512</orcidid></search><sort><creationdate>2022</creationdate><title>Using Lip Reading Recognition to Predict Daily Mandarin Conversation</title><author>Haq, Muhamad Amirul ; Ruan, Shanq-Jang ; Cai, Wen-Jie ; Li, Lieber Po-Hung</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c338t-ac872399af2394bec1e0431e92ce8ff57c5efff4522fa05da972ab76ead25dea3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2022</creationdate><topic>Audio data</topic><topic>Automatic speech recognition</topic><topic>Background noise</topic><topic>Conversation</topic><topic>Convolution</topic><topic>Datasets</topic><topic>Deep learning</topic><topic>Feature extraction</topic><topic>hearing aid</topic><topic>Hearing aids</topic><topic>Hidden Markov models</topic><topic>Lip reading</topic><topic>Lipreading</topic><topic>Lips</topic><topic>Mandarin</topic><topic>Mandarin lip reading</topic><topic>Neural networks</topic><topic>Oral communication</topic><topic>Reading</topic><topic>speech aid</topic><topic>Speech recognition</topic><topic>Tuition</topic><topic>Videos</topic><topic>visual speech recognition</topic><topic>Visual tasks</topic><topic>Visualization</topic><topic>Voice recognition</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Haq, Muhamad Amirul</creatorcontrib><creatorcontrib>Ruan, Shanq-Jang</creatorcontrib><creatorcontrib>Cai, Wen-Jie</creatorcontrib><creatorcontrib>Li, Lieber Po-Hung</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005-present</collection><collection>IEEE Xplore Open Access Journals</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998-Present</collection><collection>IEL</collection><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Electronics &amp; Communications Abstracts</collection><collection>Engineered Materials Abstracts</collection><collection>Linguistics and Language Behavior Abstracts (LLBA)</collection><collection>METADEX</collection><collection>Technology Research Database</collection><collection>Materials Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts – Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><collection>DOAJ Directory of Open Access Journals</collection><jtitle>IEEE access</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Haq, Muhamad Amirul</au><au>Ruan, Shanq-Jang</au><au>Cai, Wen-Jie</au><au>Li, Lieber Po-Hung</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Using Lip Reading Recognition to Predict Daily Mandarin Conversation</atitle><jtitle>IEEE access</jtitle><stitle>Access</stitle><date>2022</date><risdate>2022</risdate><volume>10</volume><spage>53481</spage><epage>53489</epage><pages>53481-53489</pages><issn>2169-3536</issn><eissn>2169-3536</eissn><coden>IAECCG</coden><abstract>Audio-based automatic speech recognition as a hearing aid is susceptible to background noise and overlapping speeches. Consequently, audio-visual speech recognition has been developed to complement the audio input with additional visual information. However, the huge improvement of neural networks in the visual task has resulted in a robust and reliable lip reading framework that can recognize speech from visual input alone. In this work, we propose a lip reading recognition model to predict daily Mandarin conversation and collect a new Daily Mandarin Conversation Lip Reading (DMCLR) dataset, consisting of 1,000 videos from 100 daily conversations spoken by ten speakers. Our model consists of a spatiotemporal convolution layer, a SE-ResNet-18 network, and a back-end module consisting of bi-directional gated recurrent unit (Bi-GRU), 1D convolution, and fully-connected layers. This model is able to reach 94.2% of accuracy in the DMCLR dataset. Such performance makes it possible for Mandarin lip reading applications to be practical in real life. Additionally, we are able to achieve 86.6% and 57.2% accuracy on Lip Reading in the Wild (LRW) and LRW-1000 (Mandarin), respectively. The results show that our method achieves state-of-the-art performance on these two challenging datasets.</abstract><cop>Piscataway</cop><pub>IEEE</pub><doi>10.1109/ACCESS.2022.3175867</doi><tpages>9</tpages><orcidid>https://orcid.org/0000-0003-4799-4105</orcidid><orcidid>https://orcid.org/0000-0003-1516-0229</orcidid><orcidid>https://orcid.org/0000-0003-1075-8512</orcidid><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier ISSN: 2169-3536
ispartof IEEE access, 2022, Vol.10, p.53481-53489
issn 2169-3536
2169-3536
language eng
recordid cdi_doaj_primary_oai_doaj_org_article_887bfb741ad7446389e9a6def3aedf2a
source Linguistics and Language Behavior Abstracts (LLBA); IEEE Xplore Open Access Journals
subjects Audio data
Automatic speech recognition
Background noise
Conversation
Convolution
Datasets
Deep learning
Feature extraction
hearing aid
Hearing aids
Hidden Markov models
Lip reading
Lipreading
Lips
Mandarin
Mandarin lip reading
Neural networks
Oral communication
Reading
speech aid
Speech recognition
Tuition
Videos
visual speech recognition
Visual tasks
Visualization
Voice recognition
title Using Lip Reading Recognition to Predict Daily Mandarin Conversation
url http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-06T06%3A58%3A29IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_doaj_&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Using%20Lip%20Reading%20Recognition%20to%20Predict%20Daily%20Mandarin%20Conversation&rft.jtitle=IEEE%20access&rft.au=Haq,%20Muhamad%20Amirul&rft.date=2022&rft.volume=10&rft.spage=53481&rft.epage=53489&rft.pages=53481-53489&rft.issn=2169-3536&rft.eissn=2169-3536&rft.coden=IAECCG&rft_id=info:doi/10.1109/ACCESS.2022.3175867&rft_dat=%3Cproquest_doaj_%3E2669158870%3C/proquest_doaj_%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-c338t-ac872399af2394bec1e0431e92ce8ff57c5efff4522fa05da972ab76ead25dea3%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_pqid=2669158870&rft_id=info:pmid/&rft_ieee_id=9777678&rfr_iscdi=true