Loading…
Using Lip Reading Recognition to Predict Daily Mandarin Conversation
Audio-based automatic speech recognition as a hearing aid is susceptible to background noise and overlapping speeches. Consequently, audio-visual speech recognition has been developed to complement the audio input with additional visual information. However, the huge improvement of neural networks i...
Saved in:
Published in: | IEEE access 2022, Vol.10, p.53481-53489 |
---|---|
Main Authors: | , , , |
Format: | Article |
Language: | English |
Subjects: | |
Citations: | Items that this one cites Items that cite this one |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
cited_by | cdi_FETCH-LOGICAL-c338t-ac872399af2394bec1e0431e92ce8ff57c5efff4522fa05da972ab76ead25dea3 |
---|---|
cites | cdi_FETCH-LOGICAL-c338t-ac872399af2394bec1e0431e92ce8ff57c5efff4522fa05da972ab76ead25dea3 |
container_end_page | 53489 |
container_issue | |
container_start_page | 53481 |
container_title | IEEE access |
container_volume | 10 |
creator | Haq, Muhamad Amirul Ruan, Shanq-Jang Cai, Wen-Jie Li, Lieber Po-Hung |
description | Audio-based automatic speech recognition as a hearing aid is susceptible to background noise and overlapping speeches. Consequently, audio-visual speech recognition has been developed to complement the audio input with additional visual information. However, the huge improvement of neural networks in the visual task has resulted in a robust and reliable lip reading framework that can recognize speech from visual input alone. In this work, we propose a lip reading recognition model to predict daily Mandarin conversation and collect a new Daily Mandarin Conversation Lip Reading (DMCLR) dataset, consisting of 1,000 videos from 100 daily conversations spoken by ten speakers. Our model consists of a spatiotemporal convolution layer, a SE-ResNet-18 network, and a back-end module consisting of bi-directional gated recurrent unit (Bi-GRU), 1D convolution, and fully-connected layers. This model is able to reach 94.2% of accuracy in the DMCLR dataset. Such performance makes it possible for Mandarin lip reading applications to be practical in real life. Additionally, we are able to achieve 86.6% and 57.2% accuracy on Lip Reading in the Wild (LRW) and LRW-1000 (Mandarin), respectively. The results show that our method achieves state-of-the-art performance on these two challenging datasets. |
doi_str_mv | 10.1109/ACCESS.2022.3175867 |
format | article |
fullrecord | <record><control><sourceid>proquest_doaj_</sourceid><recordid>TN_cdi_doaj_primary_oai_doaj_org_article_887bfb741ad7446389e9a6def3aedf2a</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>9777678</ieee_id><doaj_id>oai_doaj_org_article_887bfb741ad7446389e9a6def3aedf2a</doaj_id><sourcerecordid>2669158870</sourcerecordid><originalsourceid>FETCH-LOGICAL-c338t-ac872399af2394bec1e0431e92ce8ff57c5efff4522fa05da972ab76ead25dea3</originalsourceid><addsrcrecordid>eNpNUE1Lw0AQDaKgVH-Bl4Dn1uxu9uso8atQUVo9L5Pd2bKlZusmFfrvTYwU5zDzGN57M7wsuybFjJBC395V1cNqNaMFpTNGJFdCnmQXlAg9ZZyJ03_4PLtq203Rl-pXXF5k9x9taNb5IuzyJYIb8BJtXDehC7HJu5i_JXTBdvk9hO0hf4HGQQpNXsXmG1MLA-0yO_OwbfHqb06yj8eH9-p5unh9mld3i6llTHVTsEpSpjX4vpc1WoJFyQhqalF5z6Xl6L0vOaUeCu5ASwq1FP1flDsENsnmo6-LsDG7FD4hHUyEYH4XMa0NpC7YLRqlZO1rWRJwsiwFUxo1CIeeATpPB6-b0WuX4tce285s4j41_fuGCqEJ7x2KnsVGlk2xbRP641VSmCF9M6ZvhvTNX_q96npUBUQ8KrSUUkjFfgAdRYES</addsrcrecordid><sourcetype>Open Website</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2669158870</pqid></control><display><type>article</type><title>Using Lip Reading Recognition to Predict Daily Mandarin Conversation</title><source>Linguistics and Language Behavior Abstracts (LLBA)</source><source>IEEE Xplore Open Access Journals</source><creator>Haq, Muhamad Amirul ; Ruan, Shanq-Jang ; Cai, Wen-Jie ; Li, Lieber Po-Hung</creator><creatorcontrib>Haq, Muhamad Amirul ; Ruan, Shanq-Jang ; Cai, Wen-Jie ; Li, Lieber Po-Hung</creatorcontrib><description>Audio-based automatic speech recognition as a hearing aid is susceptible to background noise and overlapping speeches. Consequently, audio-visual speech recognition has been developed to complement the audio input with additional visual information. However, the huge improvement of neural networks in the visual task has resulted in a robust and reliable lip reading framework that can recognize speech from visual input alone. In this work, we propose a lip reading recognition model to predict daily Mandarin conversation and collect a new Daily Mandarin Conversation Lip Reading (DMCLR) dataset, consisting of 1,000 videos from 100 daily conversations spoken by ten speakers. Our model consists of a spatiotemporal convolution layer, a SE-ResNet-18 network, and a back-end module consisting of bi-directional gated recurrent unit (Bi-GRU), 1D convolution, and fully-connected layers. This model is able to reach 94.2% of accuracy in the DMCLR dataset. Such performance makes it possible for Mandarin lip reading applications to be practical in real life. Additionally, we are able to achieve 86.6% and 57.2% accuracy on Lip Reading in the Wild (LRW) and LRW-1000 (Mandarin), respectively. The results show that our method achieves state-of-the-art performance on these two challenging datasets.</description><identifier>ISSN: 2169-3536</identifier><identifier>EISSN: 2169-3536</identifier><identifier>DOI: 10.1109/ACCESS.2022.3175867</identifier><identifier>CODEN: IAECCG</identifier><language>eng</language><publisher>Piscataway: IEEE</publisher><subject>Audio data ; Automatic speech recognition ; Background noise ; Conversation ; Convolution ; Datasets ; Deep learning ; Feature extraction ; hearing aid ; Hearing aids ; Hidden Markov models ; Lip reading ; Lipreading ; Lips ; Mandarin ; Mandarin lip reading ; Neural networks ; Oral communication ; Reading ; speech aid ; Speech recognition ; Tuition ; Videos ; visual speech recognition ; Visual tasks ; Visualization ; Voice recognition</subject><ispartof>IEEE access, 2022, Vol.10, p.53481-53489</ispartof><rights>Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2022</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c338t-ac872399af2394bec1e0431e92ce8ff57c5efff4522fa05da972ab76ead25dea3</citedby><cites>FETCH-LOGICAL-c338t-ac872399af2394bec1e0431e92ce8ff57c5efff4522fa05da972ab76ead25dea3</cites><orcidid>0000-0003-4799-4105 ; 0000-0003-1516-0229 ; 0000-0003-1075-8512</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/9777678$$EHTML$$P50$$Gieee$$Hfree_for_read</linktohtml><link.rule.ids>314,780,784,4024,27633,27923,27924,27925,31269,54933</link.rule.ids></links><search><creatorcontrib>Haq, Muhamad Amirul</creatorcontrib><creatorcontrib>Ruan, Shanq-Jang</creatorcontrib><creatorcontrib>Cai, Wen-Jie</creatorcontrib><creatorcontrib>Li, Lieber Po-Hung</creatorcontrib><title>Using Lip Reading Recognition to Predict Daily Mandarin Conversation</title><title>IEEE access</title><addtitle>Access</addtitle><description>Audio-based automatic speech recognition as a hearing aid is susceptible to background noise and overlapping speeches. Consequently, audio-visual speech recognition has been developed to complement the audio input with additional visual information. However, the huge improvement of neural networks in the visual task has resulted in a robust and reliable lip reading framework that can recognize speech from visual input alone. In this work, we propose a lip reading recognition model to predict daily Mandarin conversation and collect a new Daily Mandarin Conversation Lip Reading (DMCLR) dataset, consisting of 1,000 videos from 100 daily conversations spoken by ten speakers. Our model consists of a spatiotemporal convolution layer, a SE-ResNet-18 network, and a back-end module consisting of bi-directional gated recurrent unit (Bi-GRU), 1D convolution, and fully-connected layers. This model is able to reach 94.2% of accuracy in the DMCLR dataset. Such performance makes it possible for Mandarin lip reading applications to be practical in real life. Additionally, we are able to achieve 86.6% and 57.2% accuracy on Lip Reading in the Wild (LRW) and LRW-1000 (Mandarin), respectively. The results show that our method achieves state-of-the-art performance on these two challenging datasets.</description><subject>Audio data</subject><subject>Automatic speech recognition</subject><subject>Background noise</subject><subject>Conversation</subject><subject>Convolution</subject><subject>Datasets</subject><subject>Deep learning</subject><subject>Feature extraction</subject><subject>hearing aid</subject><subject>Hearing aids</subject><subject>Hidden Markov models</subject><subject>Lip reading</subject><subject>Lipreading</subject><subject>Lips</subject><subject>Mandarin</subject><subject>Mandarin lip reading</subject><subject>Neural networks</subject><subject>Oral communication</subject><subject>Reading</subject><subject>speech aid</subject><subject>Speech recognition</subject><subject>Tuition</subject><subject>Videos</subject><subject>visual speech recognition</subject><subject>Visual tasks</subject><subject>Visualization</subject><subject>Voice recognition</subject><issn>2169-3536</issn><issn>2169-3536</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2022</creationdate><recordtype>article</recordtype><sourceid>ESBDL</sourceid><sourceid>7T9</sourceid><sourceid>DOA</sourceid><recordid>eNpNUE1Lw0AQDaKgVH-Bl4Dn1uxu9uso8atQUVo9L5Pd2bKlZusmFfrvTYwU5zDzGN57M7wsuybFjJBC395V1cNqNaMFpTNGJFdCnmQXlAg9ZZyJ03_4PLtq203Rl-pXXF5k9x9taNb5IuzyJYIb8BJtXDehC7HJu5i_JXTBdvk9hO0hf4HGQQpNXsXmG1MLA-0yO_OwbfHqb06yj8eH9-p5unh9mld3i6llTHVTsEpSpjX4vpc1WoJFyQhqalF5z6Xl6L0vOaUeCu5ASwq1FP1flDsENsnmo6-LsDG7FD4hHUyEYH4XMa0NpC7YLRqlZO1rWRJwsiwFUxo1CIeeATpPB6-b0WuX4tce285s4j41_fuGCqEJ7x2KnsVGlk2xbRP641VSmCF9M6ZvhvTNX_q96npUBUQ8KrSUUkjFfgAdRYES</recordid><startdate>2022</startdate><enddate>2022</enddate><creator>Haq, Muhamad Amirul</creator><creator>Ruan, Shanq-Jang</creator><creator>Cai, Wen-Jie</creator><creator>Li, Lieber Po-Hung</creator><general>IEEE</general><general>The Institute of Electrical and Electronics Engineers, Inc. (IEEE)</general><scope>97E</scope><scope>ESBDL</scope><scope>RIA</scope><scope>RIE</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>7SP</scope><scope>7SR</scope><scope>7T9</scope><scope>8BQ</scope><scope>8FD</scope><scope>JG9</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><scope>DOA</scope><orcidid>https://orcid.org/0000-0003-4799-4105</orcidid><orcidid>https://orcid.org/0000-0003-1516-0229</orcidid><orcidid>https://orcid.org/0000-0003-1075-8512</orcidid></search><sort><creationdate>2022</creationdate><title>Using Lip Reading Recognition to Predict Daily Mandarin Conversation</title><author>Haq, Muhamad Amirul ; Ruan, Shanq-Jang ; Cai, Wen-Jie ; Li, Lieber Po-Hung</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c338t-ac872399af2394bec1e0431e92ce8ff57c5efff4522fa05da972ab76ead25dea3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2022</creationdate><topic>Audio data</topic><topic>Automatic speech recognition</topic><topic>Background noise</topic><topic>Conversation</topic><topic>Convolution</topic><topic>Datasets</topic><topic>Deep learning</topic><topic>Feature extraction</topic><topic>hearing aid</topic><topic>Hearing aids</topic><topic>Hidden Markov models</topic><topic>Lip reading</topic><topic>Lipreading</topic><topic>Lips</topic><topic>Mandarin</topic><topic>Mandarin lip reading</topic><topic>Neural networks</topic><topic>Oral communication</topic><topic>Reading</topic><topic>speech aid</topic><topic>Speech recognition</topic><topic>Tuition</topic><topic>Videos</topic><topic>visual speech recognition</topic><topic>Visual tasks</topic><topic>Visualization</topic><topic>Voice recognition</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Haq, Muhamad Amirul</creatorcontrib><creatorcontrib>Ruan, Shanq-Jang</creatorcontrib><creatorcontrib>Cai, Wen-Jie</creatorcontrib><creatorcontrib>Li, Lieber Po-Hung</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005-present</collection><collection>IEEE Xplore Open Access Journals</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998-Present</collection><collection>IEL</collection><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Electronics & Communications Abstracts</collection><collection>Engineered Materials Abstracts</collection><collection>Linguistics and Language Behavior Abstracts (LLBA)</collection><collection>METADEX</collection><collection>Technology Research Database</collection><collection>Materials Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts – Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><collection>DOAJ Directory of Open Access Journals</collection><jtitle>IEEE access</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Haq, Muhamad Amirul</au><au>Ruan, Shanq-Jang</au><au>Cai, Wen-Jie</au><au>Li, Lieber Po-Hung</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Using Lip Reading Recognition to Predict Daily Mandarin Conversation</atitle><jtitle>IEEE access</jtitle><stitle>Access</stitle><date>2022</date><risdate>2022</risdate><volume>10</volume><spage>53481</spage><epage>53489</epage><pages>53481-53489</pages><issn>2169-3536</issn><eissn>2169-3536</eissn><coden>IAECCG</coden><abstract>Audio-based automatic speech recognition as a hearing aid is susceptible to background noise and overlapping speeches. Consequently, audio-visual speech recognition has been developed to complement the audio input with additional visual information. However, the huge improvement of neural networks in the visual task has resulted in a robust and reliable lip reading framework that can recognize speech from visual input alone. In this work, we propose a lip reading recognition model to predict daily Mandarin conversation and collect a new Daily Mandarin Conversation Lip Reading (DMCLR) dataset, consisting of 1,000 videos from 100 daily conversations spoken by ten speakers. Our model consists of a spatiotemporal convolution layer, a SE-ResNet-18 network, and a back-end module consisting of bi-directional gated recurrent unit (Bi-GRU), 1D convolution, and fully-connected layers. This model is able to reach 94.2% of accuracy in the DMCLR dataset. Such performance makes it possible for Mandarin lip reading applications to be practical in real life. Additionally, we are able to achieve 86.6% and 57.2% accuracy on Lip Reading in the Wild (LRW) and LRW-1000 (Mandarin), respectively. The results show that our method achieves state-of-the-art performance on these two challenging datasets.</abstract><cop>Piscataway</cop><pub>IEEE</pub><doi>10.1109/ACCESS.2022.3175867</doi><tpages>9</tpages><orcidid>https://orcid.org/0000-0003-4799-4105</orcidid><orcidid>https://orcid.org/0000-0003-1516-0229</orcidid><orcidid>https://orcid.org/0000-0003-1075-8512</orcidid><oa>free_for_read</oa></addata></record> |
fulltext | fulltext |
identifier | ISSN: 2169-3536 |
ispartof | IEEE access, 2022, Vol.10, p.53481-53489 |
issn | 2169-3536 2169-3536 |
language | eng |
recordid | cdi_doaj_primary_oai_doaj_org_article_887bfb741ad7446389e9a6def3aedf2a |
source | Linguistics and Language Behavior Abstracts (LLBA); IEEE Xplore Open Access Journals |
subjects | Audio data Automatic speech recognition Background noise Conversation Convolution Datasets Deep learning Feature extraction hearing aid Hearing aids Hidden Markov models Lip reading Lipreading Lips Mandarin Mandarin lip reading Neural networks Oral communication Reading speech aid Speech recognition Tuition Videos visual speech recognition Visual tasks Visualization Voice recognition |
title | Using Lip Reading Recognition to Predict Daily Mandarin Conversation |
url | http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-06T06%3A58%3A29IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_doaj_&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Using%20Lip%20Reading%20Recognition%20to%20Predict%20Daily%20Mandarin%20Conversation&rft.jtitle=IEEE%20access&rft.au=Haq,%20Muhamad%20Amirul&rft.date=2022&rft.volume=10&rft.spage=53481&rft.epage=53489&rft.pages=53481-53489&rft.issn=2169-3536&rft.eissn=2169-3536&rft.coden=IAECCG&rft_id=info:doi/10.1109/ACCESS.2022.3175867&rft_dat=%3Cproquest_doaj_%3E2669158870%3C/proquest_doaj_%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-c338t-ac872399af2394bec1e0431e92ce8ff57c5efff4522fa05da972ab76ead25dea3%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_pqid=2669158870&rft_id=info:pmid/&rft_ieee_id=9777678&rfr_iscdi=true |