Loading…

Fuzzy Phoneme Classification Using Multi-speaker Vocal Tract Length Normalization

The overall success of automatic speech recognition (ASR) depends on efficient phoneme recognition performance and quality of speech signal received in ASR. However, dissimilar inputs of speakers affect the overall recognition performance. One of the main problems that affect recognition performance...

Full description

Saved in:
Bibliographic Details
Published in:Technical review - IETE 2014-03, Vol.31 (2), p.128-136
Main Authors: Lung, Jensen Wong Jing, Salam, Md. Sah Hj, Rehman, Amjad, Rahim, Mohd Shafry Mohd, Saba, Tanzila
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Items that cite this one
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
cited_by cdi_FETCH-LOGICAL-c307t-53dbd93eb3e136f17f3a86e8ff527586092672a039672a9b5925e454036e2b7b3
cites cdi_FETCH-LOGICAL-c307t-53dbd93eb3e136f17f3a86e8ff527586092672a039672a9b5925e454036e2b7b3
container_end_page 136
container_issue 2
container_start_page 128
container_title Technical review - IETE
container_volume 31
creator Lung, Jensen Wong Jing
Salam, Md. Sah Hj
Rehman, Amjad
Rahim, Mohd Shafry Mohd
Saba, Tanzila
description The overall success of automatic speech recognition (ASR) depends on efficient phoneme recognition performance and quality of speech signal received in ASR. However, dissimilar inputs of speakers affect the overall recognition performance. One of the main problems that affect recognition performance is inter-speaker variability. Vocal tract length normalization (VTLN) is introduced to compensate inter-speaker variation on the speaker signal by applying speaker-specific warping of the frequency scale of a filter bank. Instead of measuring the performance on word level with speaker-specific warping, this research focuses on direct tackling at the phoneme level and applying VTLN on all speakers' speech signals to analyse the best setting for the highest recognition performance. This research seeks to compare each phoneme recognition results from warping factor between 0.74 and 1.54 with 0.02 increments on nine different ranges of frequency warping boundary. The warp factor and frequency warping range that provides the highest phoneme recognition performance is applied on word recognition. The results show an improved performance in phoneme recognition by 0.7% and spoken word recognition by 0.5% using warp factor of 1.40 on frequency range of 300-5000 Hz in comparison to baseline results.
doi_str_mv 10.1080/02564602.2014.892669
format article
fullrecord <record><control><sourceid>crossref_infor</sourceid><recordid>TN_cdi_crossref_primary_10_1080_02564602_2014_892669</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>10_1080_02564602_2014_892669</sourcerecordid><originalsourceid>FETCH-LOGICAL-c307t-53dbd93eb3e136f17f3a86e8ff527586092672a039672a9b5925e454036e2b7b3</originalsourceid><addsrcrecordid>eNp9kMtOAjEYhRujiYi-gYu-wGDvna6MISImeEvAbdMZWqiWKWmHGHh6Z0S3rs5ZnHPy_x8A1xiNMCrRDSJcMIHIiCDMRqUiQqgTMEBKsoIriU8730WKPnMOLnL-QEgwwvEAvE12h8Mevq5jYzcWjoPJ2Ttfm9bHBi6yb1bwaRdaX-StNZ82wfdYmwDnydQtnNlm1a7hc0wbE_zhp3QJzpwJ2V796hAsJvfz8bSYvTw8ju9mRU2RbAtOl9VSUVtRi6lwWDpqSmFL5ziRvBSo-0ISg6jqRVVcEW4ZZ4gKSypZ0SFgx906xZyTdXqb_MakvcZI91j0HxbdY9FHLF3t9ljzjevP_oopLHVr9iEml0xT-6zpvwvfJA9pPA</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Fuzzy Phoneme Classification Using Multi-speaker Vocal Tract Length Normalization</title><source>Taylor and Francis Science and Technology Collection</source><creator>Lung, Jensen Wong Jing ; Salam, Md. Sah Hj ; Rehman, Amjad ; Rahim, Mohd Shafry Mohd ; Saba, Tanzila</creator><creatorcontrib>Lung, Jensen Wong Jing ; Salam, Md. Sah Hj ; Rehman, Amjad ; Rahim, Mohd Shafry Mohd ; Saba, Tanzila</creatorcontrib><description>The overall success of automatic speech recognition (ASR) depends on efficient phoneme recognition performance and quality of speech signal received in ASR. However, dissimilar inputs of speakers affect the overall recognition performance. One of the main problems that affect recognition performance is inter-speaker variability. Vocal tract length normalization (VTLN) is introduced to compensate inter-speaker variation on the speaker signal by applying speaker-specific warping of the frequency scale of a filter bank. Instead of measuring the performance on word level with speaker-specific warping, this research focuses on direct tackling at the phoneme level and applying VTLN on all speakers' speech signals to analyse the best setting for the highest recognition performance. This research seeks to compare each phoneme recognition results from warping factor between 0.74 and 1.54 with 0.02 increments on nine different ranges of frequency warping boundary. The warp factor and frequency warping range that provides the highest phoneme recognition performance is applied on word recognition. The results show an improved performance in phoneme recognition by 0.7% and spoken word recognition by 0.5% using warp factor of 1.40 on frequency range of 300-5000 Hz in comparison to baseline results.</description><identifier>ISSN: 0256-4602</identifier><identifier>EISSN: 0974-5971</identifier><identifier>DOI: 10.1080/02564602.2014.892669</identifier><language>eng</language><publisher>Taylor &amp; Francis</publisher><subject>Fuzzy phoneme recognition ; Inter-speaker variability ; Multi-speaker frequency warping ; Vocal tract length normalization ; Warp factor</subject><ispartof>Technical review - IETE, 2014-03, Vol.31 (2), p.128-136</ispartof><rights>2014 Taylor &amp; Francis 2014</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c307t-53dbd93eb3e136f17f3a86e8ff527586092672a039672a9b5925e454036e2b7b3</citedby><cites>FETCH-LOGICAL-c307t-53dbd93eb3e136f17f3a86e8ff527586092672a039672a9b5925e454036e2b7b3</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>314,776,780,27903,27904</link.rule.ids></links><search><creatorcontrib>Lung, Jensen Wong Jing</creatorcontrib><creatorcontrib>Salam, Md. Sah Hj</creatorcontrib><creatorcontrib>Rehman, Amjad</creatorcontrib><creatorcontrib>Rahim, Mohd Shafry Mohd</creatorcontrib><creatorcontrib>Saba, Tanzila</creatorcontrib><title>Fuzzy Phoneme Classification Using Multi-speaker Vocal Tract Length Normalization</title><title>Technical review - IETE</title><description>The overall success of automatic speech recognition (ASR) depends on efficient phoneme recognition performance and quality of speech signal received in ASR. However, dissimilar inputs of speakers affect the overall recognition performance. One of the main problems that affect recognition performance is inter-speaker variability. Vocal tract length normalization (VTLN) is introduced to compensate inter-speaker variation on the speaker signal by applying speaker-specific warping of the frequency scale of a filter bank. Instead of measuring the performance on word level with speaker-specific warping, this research focuses on direct tackling at the phoneme level and applying VTLN on all speakers' speech signals to analyse the best setting for the highest recognition performance. This research seeks to compare each phoneme recognition results from warping factor between 0.74 and 1.54 with 0.02 increments on nine different ranges of frequency warping boundary. The warp factor and frequency warping range that provides the highest phoneme recognition performance is applied on word recognition. The results show an improved performance in phoneme recognition by 0.7% and spoken word recognition by 0.5% using warp factor of 1.40 on frequency range of 300-5000 Hz in comparison to baseline results.</description><subject>Fuzzy phoneme recognition</subject><subject>Inter-speaker variability</subject><subject>Multi-speaker frequency warping</subject><subject>Vocal tract length normalization</subject><subject>Warp factor</subject><issn>0256-4602</issn><issn>0974-5971</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2014</creationdate><recordtype>article</recordtype><recordid>eNp9kMtOAjEYhRujiYi-gYu-wGDvna6MISImeEvAbdMZWqiWKWmHGHh6Z0S3rs5ZnHPy_x8A1xiNMCrRDSJcMIHIiCDMRqUiQqgTMEBKsoIriU8730WKPnMOLnL-QEgwwvEAvE12h8Mevq5jYzcWjoPJ2Ttfm9bHBi6yb1bwaRdaX-StNZ82wfdYmwDnydQtnNlm1a7hc0wbE_zhp3QJzpwJ2V796hAsJvfz8bSYvTw8ju9mRU2RbAtOl9VSUVtRi6lwWDpqSmFL5ziRvBSo-0ISg6jqRVVcEW4ZZ4gKSypZ0SFgx906xZyTdXqb_MakvcZI91j0HxbdY9FHLF3t9ljzjevP_oopLHVr9iEml0xT-6zpvwvfJA9pPA</recordid><startdate>20140304</startdate><enddate>20140304</enddate><creator>Lung, Jensen Wong Jing</creator><creator>Salam, Md. Sah Hj</creator><creator>Rehman, Amjad</creator><creator>Rahim, Mohd Shafry Mohd</creator><creator>Saba, Tanzila</creator><general>Taylor &amp; Francis</general><scope>AAYXX</scope><scope>CITATION</scope></search><sort><creationdate>20140304</creationdate><title>Fuzzy Phoneme Classification Using Multi-speaker Vocal Tract Length Normalization</title><author>Lung, Jensen Wong Jing ; Salam, Md. Sah Hj ; Rehman, Amjad ; Rahim, Mohd Shafry Mohd ; Saba, Tanzila</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c307t-53dbd93eb3e136f17f3a86e8ff527586092672a039672a9b5925e454036e2b7b3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2014</creationdate><topic>Fuzzy phoneme recognition</topic><topic>Inter-speaker variability</topic><topic>Multi-speaker frequency warping</topic><topic>Vocal tract length normalization</topic><topic>Warp factor</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Lung, Jensen Wong Jing</creatorcontrib><creatorcontrib>Salam, Md. Sah Hj</creatorcontrib><creatorcontrib>Rehman, Amjad</creatorcontrib><creatorcontrib>Rahim, Mohd Shafry Mohd</creatorcontrib><creatorcontrib>Saba, Tanzila</creatorcontrib><collection>CrossRef</collection><jtitle>Technical review - IETE</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Lung, Jensen Wong Jing</au><au>Salam, Md. Sah Hj</au><au>Rehman, Amjad</au><au>Rahim, Mohd Shafry Mohd</au><au>Saba, Tanzila</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Fuzzy Phoneme Classification Using Multi-speaker Vocal Tract Length Normalization</atitle><jtitle>Technical review - IETE</jtitle><date>2014-03-04</date><risdate>2014</risdate><volume>31</volume><issue>2</issue><spage>128</spage><epage>136</epage><pages>128-136</pages><issn>0256-4602</issn><eissn>0974-5971</eissn><abstract>The overall success of automatic speech recognition (ASR) depends on efficient phoneme recognition performance and quality of speech signal received in ASR. However, dissimilar inputs of speakers affect the overall recognition performance. One of the main problems that affect recognition performance is inter-speaker variability. Vocal tract length normalization (VTLN) is introduced to compensate inter-speaker variation on the speaker signal by applying speaker-specific warping of the frequency scale of a filter bank. Instead of measuring the performance on word level with speaker-specific warping, this research focuses on direct tackling at the phoneme level and applying VTLN on all speakers' speech signals to analyse the best setting for the highest recognition performance. This research seeks to compare each phoneme recognition results from warping factor between 0.74 and 1.54 with 0.02 increments on nine different ranges of frequency warping boundary. The warp factor and frequency warping range that provides the highest phoneme recognition performance is applied on word recognition. The results show an improved performance in phoneme recognition by 0.7% and spoken word recognition by 0.5% using warp factor of 1.40 on frequency range of 300-5000 Hz in comparison to baseline results.</abstract><pub>Taylor &amp; Francis</pub><doi>10.1080/02564602.2014.892669</doi><tpages>9</tpages></addata></record>
fulltext fulltext
identifier ISSN: 0256-4602
ispartof Technical review - IETE, 2014-03, Vol.31 (2), p.128-136
issn 0256-4602
0974-5971
language eng
recordid cdi_crossref_primary_10_1080_02564602_2014_892669
source Taylor and Francis Science and Technology Collection
subjects Fuzzy phoneme recognition
Inter-speaker variability
Multi-speaker frequency warping
Vocal tract length normalization
Warp factor
title Fuzzy Phoneme Classification Using Multi-speaker Vocal Tract Length Normalization
url http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-28T06%3A04%3A52IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-crossref_infor&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Fuzzy%20Phoneme%20Classification%20Using%20Multi-speaker%20Vocal%20Tract%20Length%20Normalization&rft.jtitle=Technical%20review%20-%20IETE&rft.au=Lung,%20Jensen%20Wong%20Jing&rft.date=2014-03-04&rft.volume=31&rft.issue=2&rft.spage=128&rft.epage=136&rft.pages=128-136&rft.issn=0256-4602&rft.eissn=0974-5971&rft_id=info:doi/10.1080/02564602.2014.892669&rft_dat=%3Ccrossref_infor%3E10_1080_02564602_2014_892669%3C/crossref_infor%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-c307t-53dbd93eb3e136f17f3a86e8ff527586092672a039672a9b5925e454036e2b7b3%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true