Loading…

Automatic speaker profiling from short duration speech data

•Speaker profiling scenario using the short duration and multilingual setting.•A common set of features for age and other physical parameters’ (height, weight, shoulder size, waist size) estimation.•Harmonic frequency location and amplitude features are proposed for physical parameter estimation.•Du...

Full description

Saved in:

Bibliographic Details
Published in:	Speech communication 2020-08, Vol.121, p.16-28
Main Authors:	Kalluri, Shareef Babu, Vijayasenan, Deepu, Ganapathy, Sriram
Format:	Article
Language:	English
Subjects:	Age Chronology Construction Datasets Formants Harmonics Parameter estimation Physical properties Regression analysis Regression models Short duration Speaker profiling Speech Speech duration Speech recognition Streams Support vector machines Voice recognition
Citations:	Items that this one cites Items that cite this one
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

cited_by	cdi_FETCH-LOGICAL-c334t-54f5bb076186e61ffc613191f7091dd9641ef19db92f0d6da74649be44b85743
cites	cdi_FETCH-LOGICAL-c334t-54f5bb076186e61ffc613191f7091dd9641ef19db92f0d6da74649be44b85743
container_end_page	28
container_issue
container_start_page	16
container_title	Speech communication
container_volume	121
creator	Kalluri, Shareef Babu Vijayasenan, Deepu Ganapathy, Sriram
description	•Speaker profiling scenario using the short duration and multilingual setting.•A common set of features for age and other physical parameters’ (height, weight, shoulder size, waist size) estimation.•Harmonic frequency location and amplitude features are proposed for physical parameter estimation.•Duration analysis is performed to determine the minimal duration of speech required to estimate each physical parameter. Many paralinguistic applications of speech demand the extraction of information about the speaker characteristics from as little speech data as possible. In this work, we explore the estimation of multiple physical parameters of the speaker from the short duration of speech in a multilingual setting. We explore different feature streams for age and body build estimation derived from the speech spectrum at different resolutions, namely – short-term log-mel spectrogram, formant features and harmonic features of the speech. The statistics of these features over the speech recording are used to learn a support vector regression model for speaker age and body build estimation. The experiments performed on the TIMIT dataset show that each of the individual features is able to achieve results that outperform previously published results in height and age estimation. Furthermore, the estimation errors from these different feature streams are complementary, which allows the combination of estimates from these feature streams to further improve the results. The combined system from short audio snippets achieves a performance of 5.2 cm, and 4.8 cm in Mean Absolute Error (MAE) for male and female respectively for height estimation. Similarly in age estimation the MAE is of 5.2 years, and 5.6 years for male, and female speakers respectively. We also extend the same physical parameter estimation to other body build parameters like shoulder width, waist size and weight along with height on a dataset we collected for speaker profiling. The duration analysis of the proposed scheme shows that the state of the art results can be achieved using only around 1–2 s of speech data. To the best of our knowledge, this is the first attempt to use a common set of features for estimating the different physical traits of a speaker.
doi_str_mv	10.1016/j.specom.2020.03.008
format	article
fullrecord	<record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_proquest_journals_2437905331</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><els_id>S0167639319301074</els_id><sourcerecordid>2437905331</sourcerecordid><originalsourceid>FETCH-LOGICAL-c334t-54f5bb076186e61ffc613191f7091dd9641ef19db92f0d6da74649be44b85743</originalsourceid><addsrcrecordid>eNp9kE1LxDAQhoMouK7-Aw8Fz62TJk0aBGFZ_IIFL3sPbT7c1G2zJq3gvzelnj3N5XnfmXkQusVQYMDsviviySjfFyWUUAApAOoztMI1L3OO6_IcrRLGc0YEuURXMXYAQOu6XKGHzTT6vhmdylJH82lCdgreuqMbPjIbfJ_Fgw9jpqeQID_MlFGHTDdjc40ubHOM5uZvrtH--Wm_fc137y9v280uV4TQMa-ordoWOMM1MwxbqxgmWGDLQWCtBaPYWCx0K0oLmumGU0ZFayht64pTskZ3S2067GsycZSdn8KQNsqSEi6gIgQnii6UCj7GYKw8Bdc34UdikLMl2cnFkpwtSSAyWUqxxyVm0gPfzgQZlTODMtoFo0apvfu_4BeOWHGN</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2437905331</pqid></control><display><type>article</type><title>Automatic speaker profiling from short duration speech data</title><source>ScienceDirect Freedom Collection 2022-2024</source><source>Linguistics and Language Behavior Abstracts (LLBA)</source><creator>Kalluri, Shareef Babu ; Vijayasenan, Deepu ; Ganapathy, Sriram</creator><creatorcontrib>Kalluri, Shareef Babu ; Vijayasenan, Deepu ; Ganapathy, Sriram</creatorcontrib><description>•Speaker profiling scenario using the short duration and multilingual setting.•A common set of features for age and other physical parameters’ (height, weight, shoulder size, waist size) estimation.•Harmonic frequency location and amplitude features are proposed for physical parameter estimation.•Duration analysis is performed to determine the minimal duration of speech required to estimate each physical parameter. Many paralinguistic applications of speech demand the extraction of information about the speaker characteristics from as little speech data as possible. In this work, we explore the estimation of multiple physical parameters of the speaker from the short duration of speech in a multilingual setting. We explore different feature streams for age and body build estimation derived from the speech spectrum at different resolutions, namely – short-term log-mel spectrogram, formant features and harmonic features of the speech. The statistics of these features over the speech recording are used to learn a support vector regression model for speaker age and body build estimation. The experiments performed on the TIMIT dataset show that each of the individual features is able to achieve results that outperform previously published results in height and age estimation. Furthermore, the estimation errors from these different feature streams are complementary, which allows the combination of estimates from these feature streams to further improve the results. The combined system from short audio snippets achieves a performance of 5.2 cm, and 4.8 cm in Mean Absolute Error (MAE) for male and female respectively for height estimation. Similarly in age estimation the MAE is of 5.2 years, and 5.6 years for male, and female speakers respectively. We also extend the same physical parameter estimation to other body build parameters like shoulder width, waist size and weight along with height on a dataset we collected for speaker profiling. The duration analysis of the proposed scheme shows that the state of the art results can be achieved using only around 1–2 s of speech data. To the best of our knowledge, this is the first attempt to use a common set of features for estimating the different physical traits of a speaker.</description><identifier>ISSN: 0167-6393</identifier><identifier>EISSN: 1872-7182</identifier><identifier>DOI: 10.1016/j.specom.2020.03.008</identifier><language>eng</language><publisher>Amsterdam: Elsevier B.V</publisher><subject>Age ; Chronology ; Construction ; Datasets ; Formants ; Harmonics ; Parameter estimation ; Physical properties ; Regression analysis ; Regression models ; Short duration ; Speaker profiling ; Speech ; Speech duration ; Speech recognition ; Streams ; Support vector machines ; Voice recognition</subject><ispartof>Speech communication, 2020-08, Vol.121, p.16-28</ispartof><rights>2020 Elsevier B.V.</rights><rights>Copyright Elsevier Science Ltd. Aug 2020</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c334t-54f5bb076186e61ffc613191f7091dd9641ef19db92f0d6da74649be44b85743</citedby><cites>FETCH-LOGICAL-c334t-54f5bb076186e61ffc613191f7091dd9641ef19db92f0d6da74649be44b85743</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>314,780,784,27924,27925,31269</link.rule.ids></links><search><creatorcontrib>Kalluri, Shareef Babu</creatorcontrib><creatorcontrib>Vijayasenan, Deepu</creatorcontrib><creatorcontrib>Ganapathy, Sriram</creatorcontrib><title>Automatic speaker profiling from short duration speech data</title><title>Speech communication</title><description>•Speaker profiling scenario using the short duration and multilingual setting.•A common set of features for age and other physical parameters’ (height, weight, shoulder size, waist size) estimation.•Harmonic frequency location and amplitude features are proposed for physical parameter estimation.•Duration analysis is performed to determine the minimal duration of speech required to estimate each physical parameter. Many paralinguistic applications of speech demand the extraction of information about the speaker characteristics from as little speech data as possible. In this work, we explore the estimation of multiple physical parameters of the speaker from the short duration of speech in a multilingual setting. We explore different feature streams for age and body build estimation derived from the speech spectrum at different resolutions, namely – short-term log-mel spectrogram, formant features and harmonic features of the speech. The statistics of these features over the speech recording are used to learn a support vector regression model for speaker age and body build estimation. The experiments performed on the TIMIT dataset show that each of the individual features is able to achieve results that outperform previously published results in height and age estimation. Furthermore, the estimation errors from these different feature streams are complementary, which allows the combination of estimates from these feature streams to further improve the results. The combined system from short audio snippets achieves a performance of 5.2 cm, and 4.8 cm in Mean Absolute Error (MAE) for male and female respectively for height estimation. Similarly in age estimation the MAE is of 5.2 years, and 5.6 years for male, and female speakers respectively. We also extend the same physical parameter estimation to other body build parameters like shoulder width, waist size and weight along with height on a dataset we collected for speaker profiling. The duration analysis of the proposed scheme shows that the state of the art results can be achieved using only around 1–2 s of speech data. To the best of our knowledge, this is the first attempt to use a common set of features for estimating the different physical traits of a speaker.</description><subject>Age</subject><subject>Chronology</subject><subject>Construction</subject><subject>Datasets</subject><subject>Formants</subject><subject>Harmonics</subject><subject>Parameter estimation</subject><subject>Physical properties</subject><subject>Regression analysis</subject><subject>Regression models</subject><subject>Short duration</subject><subject>Speaker profiling</subject><subject>Speech</subject><subject>Speech duration</subject><subject>Speech recognition</subject><subject>Streams</subject><subject>Support vector machines</subject><subject>Voice recognition</subject><issn>0167-6393</issn><issn>1872-7182</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2020</creationdate><recordtype>article</recordtype><sourceid>7T9</sourceid><recordid>eNp9kE1LxDAQhoMouK7-Aw8Fz62TJk0aBGFZ_IIFL3sPbT7c1G2zJq3gvzelnj3N5XnfmXkQusVQYMDsviviySjfFyWUUAApAOoztMI1L3OO6_IcrRLGc0YEuURXMXYAQOu6XKGHzTT6vhmdylJH82lCdgreuqMbPjIbfJ_Fgw9jpqeQID_MlFGHTDdjc40ubHOM5uZvrtH--Wm_fc137y9v280uV4TQMa-ordoWOMM1MwxbqxgmWGDLQWCtBaPYWCx0K0oLmumGU0ZFayht64pTskZ3S2067GsycZSdn8KQNsqSEi6gIgQnii6UCj7GYKw8Bdc34UdikLMl2cnFkpwtSSAyWUqxxyVm0gPfzgQZlTODMtoFo0apvfu_4BeOWHGN</recordid><startdate>202008</startdate><enddate>202008</enddate><creator>Kalluri, Shareef Babu</creator><creator>Vijayasenan, Deepu</creator><creator>Ganapathy, Sriram</creator><general>Elsevier B.V</general><general>Elsevier Science Ltd</general><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>7SP</scope><scope>7T9</scope><scope>8FD</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope></search><sort><creationdate>202008</creationdate><title>Automatic speaker profiling from short duration speech data</title><author>Kalluri, Shareef Babu ; Vijayasenan, Deepu ; Ganapathy, Sriram</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c334t-54f5bb076186e61ffc613191f7091dd9641ef19db92f0d6da74649be44b85743</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2020</creationdate><topic>Age</topic><topic>Chronology</topic><topic>Construction</topic><topic>Datasets</topic><topic>Formants</topic><topic>Harmonics</topic><topic>Parameter estimation</topic><topic>Physical properties</topic><topic>Regression analysis</topic><topic>Regression models</topic><topic>Short duration</topic><topic>Speaker profiling</topic><topic>Speech</topic><topic>Speech duration</topic><topic>Speech recognition</topic><topic>Streams</topic><topic>Support vector machines</topic><topic>Voice recognition</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Kalluri, Shareef Babu</creatorcontrib><creatorcontrib>Vijayasenan, Deepu</creatorcontrib><creatorcontrib>Ganapathy, Sriram</creatorcontrib><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Electronics & Communications Abstracts</collection><collection>Linguistics and Language Behavior Abstracts (LLBA)</collection><collection>Technology Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><jtitle>Speech communication</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Kalluri, Shareef Babu</au><au>Vijayasenan, Deepu</au><au>Ganapathy, Sriram</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Automatic speaker profiling from short duration speech data</atitle><jtitle>Speech communication</jtitle><date>2020-08</date><risdate>2020</risdate><volume>121</volume><spage>16</spage><epage>28</epage><pages>16-28</pages><issn>0167-6393</issn><eissn>1872-7182</eissn><abstract>•Speaker profiling scenario using the short duration and multilingual setting.•A common set of features for age and other physical parameters’ (height, weight, shoulder size, waist size) estimation.•Harmonic frequency location and amplitude features are proposed for physical parameter estimation.•Duration analysis is performed to determine the minimal duration of speech required to estimate each physical parameter. Many paralinguistic applications of speech demand the extraction of information about the speaker characteristics from as little speech data as possible. In this work, we explore the estimation of multiple physical parameters of the speaker from the short duration of speech in a multilingual setting. We explore different feature streams for age and body build estimation derived from the speech spectrum at different resolutions, namely – short-term log-mel spectrogram, formant features and harmonic features of the speech. The statistics of these features over the speech recording are used to learn a support vector regression model for speaker age and body build estimation. The experiments performed on the TIMIT dataset show that each of the individual features is able to achieve results that outperform previously published results in height and age estimation. Furthermore, the estimation errors from these different feature streams are complementary, which allows the combination of estimates from these feature streams to further improve the results. The combined system from short audio snippets achieves a performance of 5.2 cm, and 4.8 cm in Mean Absolute Error (MAE) for male and female respectively for height estimation. Similarly in age estimation the MAE is of 5.2 years, and 5.6 years for male, and female speakers respectively. We also extend the same physical parameter estimation to other body build parameters like shoulder width, waist size and weight along with height on a dataset we collected for speaker profiling. The duration analysis of the proposed scheme shows that the state of the art results can be achieved using only around 1–2 s of speech data. To the best of our knowledge, this is the first attempt to use a common set of features for estimating the different physical traits of a speaker.</abstract><cop>Amsterdam</cop><pub>Elsevier B.V</pub><doi>10.1016/j.specom.2020.03.008</doi><tpages>13</tpages></addata></record>
fulltext	fulltext
identifier	ISSN: 0167-6393
ispartof	Speech communication, 2020-08, Vol.121, p.16-28
issn	0167-6393 1872-7182
language	eng
recordid	cdi_proquest_journals_2437905331
source	ScienceDirect Freedom Collection 2022-2024; Linguistics and Language Behavior Abstracts (LLBA)
subjects	Age Chronology Construction Datasets Formants Harmonics Parameter estimation Physical properties Regression analysis Regression models Short duration Speaker profiling Speech Speech duration Speech recognition Streams Support vector machines Voice recognition
title	Automatic speaker profiling from short duration speech data
url	http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-05T14%3A17%3A11IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Automatic%20speaker%20profiling%20from%20short%20duration%20speech%20data&rft.jtitle=Speech%20communication&rft.au=Kalluri,%20Shareef%20Babu&rft.date=2020-08&rft.volume=121&rft.spage=16&rft.epage=28&rft.pages=16-28&rft.issn=0167-6393&rft.eissn=1872-7182&rft_id=info:doi/10.1016/j.specom.2020.03.008&rft_dat=%3Cproquest_cross%3E2437905331%3C/proquest_cross%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-c334t-54f5bb076186e61ffc613191f7091dd9641ef19db92f0d6da74649be44b85743%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_pqid=2437905331&rft_id=info:pmid/&rfr_iscdi=true