Loading…

Acoustic scene classification based on three-dimensional multi-channel feature-correlated deep learning networks

As an effective approach to perceive environments, acoustic scene classification (ASC) has received considerable attention in the past few years. Generally, ASC is deemed a challenging task due to subtle differences between various classes of environmental sounds. In this paper, we propose a novel a...

Full description

Saved in:
Bibliographic Details
Published in:Scientific reports 2022-08, Vol.12 (1), p.13730-13730, Article 13730
Main Authors: Qu, Yuanyuan, Li, Xuesheng, Qin, Zhiliang, Lu, Qidong
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Items that cite this one
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
cited_by cdi_FETCH-LOGICAL-c583t-a31b3e3e68ba27e0615f956b8aac6d6d833b29f8a70b91e56a80072e0611d39b3
cites cdi_FETCH-LOGICAL-c583t-a31b3e3e68ba27e0615f956b8aac6d6d833b29f8a70b91e56a80072e0611d39b3
container_end_page 13730
container_issue 1
container_start_page 13730
container_title Scientific reports
container_volume 12
creator Qu, Yuanyuan
Li, Xuesheng
Qin, Zhiliang
Lu, Qidong
description As an effective approach to perceive environments, acoustic scene classification (ASC) has received considerable attention in the past few years. Generally, ASC is deemed a challenging task due to subtle differences between various classes of environmental sounds. In this paper, we propose a novel approach to perform accurate classification based on the aggregation of spatial–temporal features extracted from a multi-branch three-dimensional (3D) convolution neural network (CNN) model. The novelties of this paper are as follows. First, we form multiple frequency-domain representations of signals by fully utilizing expert knowledge on acoustics and discrete wavelet transformations (DWT). Secondly, we propose a novel 3D CNN architecture featuring residual connections and squeeze-and-excitation attentions (3D-SE-ResNet) to effectively capture both long-term and short-term correlations inherent in environmental sounds. Thirdly, an auxiliary supervised branch based on the chromatogram of the original signal is incorporated in the proposed architecture to alleviate overfitting risks by providing supplementary information to the model. The performance of the proposed multi-input multi-feature 3D-CNN architecture is numerically evaluated on a typical large-scale dataset in the 2019 IEEE AASP Challenge on Detection and Classification of Acoustic Scenes and Events (DCASE 2019) and is shown to obtain noticeable performance gains over the state-of-the-art methods in the literature.
doi_str_mv 10.1038/s41598-022-17863-z
format article
fullrecord <record><control><sourceid>proquest_doaj_</sourceid><recordid>TN_cdi_doaj_primary_oai_doaj_org_article_4c3bc95273b0483290e3787a76b49680</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><doaj_id>oai_doaj_org_article_4c3bc95273b0483290e3787a76b49680</doaj_id><sourcerecordid>2701360096</sourcerecordid><originalsourceid>FETCH-LOGICAL-c583t-a31b3e3e68ba27e0615f956b8aac6d6d833b29f8a70b91e56a80072e0611d39b3</originalsourceid><addsrcrecordid>eNp9kktv1DAUhSMEolXpH2AViQ2bgB-JHxukquJRqRIbWFvXzs2MB4892Amo_fV4JhVQFnhjy_ecT77Xp2leUvKGEq7elp4OWnWEsY5KJXh3_6Q5Z6QfOsYZe_rX-ay5LGVH6hqY7ql-3pzxQQtGGD1vDlcuLWX2ri0OI7YuQCl-8g5mn2JroeDY1sO8zYjd6PcYSy1AaPdLmH3nthAjhnZCmJeMnUs5Y4C5ukbEQxsQcvRx00acf6b8rbxonk0QCl4-7BfN1w_vv1x_6m4_f7y5vrrt3KD43AGnliNHoSwwiUTQYdKDsArAiVGMinPL9KRAEqspDgIUIZIdhXTk2vKL5mbljgl25pD9HvKdSeDN6SLljYFc-w5oeset0wOT3JJecaYJcqkkSGF7LRSprHcr67DYPY51UHOG8Aj6uBL91mzSD6O57IUUFfD6AZDT9wXLbPa-zjsEiFjHb5isn3FsYKjSV_9Id2nJdeAnFeWCEH0EslXlciol4_T7MZSYYz7Mmg9T82FO-TD31cRXU6niuMH8B_0f1y8mIr3a</addsrcrecordid><sourcetype>Open Website</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2701360096</pqid></control><display><type>article</type><title>Acoustic scene classification based on three-dimensional multi-channel feature-correlated deep learning networks</title><source>PubMed Central Free</source><source>Free Full-Text Journals in Chemistry</source><source>Springer Nature - nature.com Journals - Fully Open Access</source><source>ProQuest Publicly Available Content database</source><creator>Qu, Yuanyuan ; Li, Xuesheng ; Qin, Zhiliang ; Lu, Qidong</creator><creatorcontrib>Qu, Yuanyuan ; Li, Xuesheng ; Qin, Zhiliang ; Lu, Qidong</creatorcontrib><description>As an effective approach to perceive environments, acoustic scene classification (ASC) has received considerable attention in the past few years. Generally, ASC is deemed a challenging task due to subtle differences between various classes of environmental sounds. In this paper, we propose a novel approach to perform accurate classification based on the aggregation of spatial–temporal features extracted from a multi-branch three-dimensional (3D) convolution neural network (CNN) model. The novelties of this paper are as follows. First, we form multiple frequency-domain representations of signals by fully utilizing expert knowledge on acoustics and discrete wavelet transformations (DWT). Secondly, we propose a novel 3D CNN architecture featuring residual connections and squeeze-and-excitation attentions (3D-SE-ResNet) to effectively capture both long-term and short-term correlations inherent in environmental sounds. Thirdly, an auxiliary supervised branch based on the chromatogram of the original signal is incorporated in the proposed architecture to alleviate overfitting risks by providing supplementary information to the model. The performance of the proposed multi-input multi-feature 3D-CNN architecture is numerically evaluated on a typical large-scale dataset in the 2019 IEEE AASP Challenge on Detection and Classification of Acoustic Scenes and Events (DCASE 2019) and is shown to obtain noticeable performance gains over the state-of-the-art methods in the literature.</description><identifier>ISSN: 2045-2322</identifier><identifier>EISSN: 2045-2322</identifier><identifier>DOI: 10.1038/s41598-022-17863-z</identifier><identifier>PMID: 35962021</identifier><language>eng</language><publisher>London: Nature Publishing Group UK</publisher><subject>639/166/987 ; 704/172/4081 ; Acoustics ; Classification ; Deep learning ; Humanities and Social Sciences ; multidisciplinary ; Neural networks ; Science ; Science (multidisciplinary) ; Temporal variations</subject><ispartof>Scientific reports, 2022-08, Vol.12 (1), p.13730-13730, Article 13730</ispartof><rights>The Author(s) 2022</rights><rights>The Author(s) 2022. This work is published under http://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c583t-a31b3e3e68ba27e0615f956b8aac6d6d833b29f8a70b91e56a80072e0611d39b3</citedby><cites>FETCH-LOGICAL-c583t-a31b3e3e68ba27e0615f956b8aac6d6d833b29f8a70b91e56a80072e0611d39b3</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktopdf>$$Uhttps://www.proquest.com/docview/2701360096/fulltextPDF?pq-origsite=primo$$EPDF$$P50$$Gproquest$$Hfree_for_read</linktopdf><linktohtml>$$Uhttps://www.proquest.com/docview/2701360096?pq-origsite=primo$$EHTML$$P50$$Gproquest$$Hfree_for_read</linktohtml><link.rule.ids>230,314,727,780,784,885,25753,27924,27925,37012,37013,44590,53791,53793,75126</link.rule.ids></links><search><creatorcontrib>Qu, Yuanyuan</creatorcontrib><creatorcontrib>Li, Xuesheng</creatorcontrib><creatorcontrib>Qin, Zhiliang</creatorcontrib><creatorcontrib>Lu, Qidong</creatorcontrib><title>Acoustic scene classification based on three-dimensional multi-channel feature-correlated deep learning networks</title><title>Scientific reports</title><addtitle>Sci Rep</addtitle><description>As an effective approach to perceive environments, acoustic scene classification (ASC) has received considerable attention in the past few years. Generally, ASC is deemed a challenging task due to subtle differences between various classes of environmental sounds. In this paper, we propose a novel approach to perform accurate classification based on the aggregation of spatial–temporal features extracted from a multi-branch three-dimensional (3D) convolution neural network (CNN) model. The novelties of this paper are as follows. First, we form multiple frequency-domain representations of signals by fully utilizing expert knowledge on acoustics and discrete wavelet transformations (DWT). Secondly, we propose a novel 3D CNN architecture featuring residual connections and squeeze-and-excitation attentions (3D-SE-ResNet) to effectively capture both long-term and short-term correlations inherent in environmental sounds. Thirdly, an auxiliary supervised branch based on the chromatogram of the original signal is incorporated in the proposed architecture to alleviate overfitting risks by providing supplementary information to the model. The performance of the proposed multi-input multi-feature 3D-CNN architecture is numerically evaluated on a typical large-scale dataset in the 2019 IEEE AASP Challenge on Detection and Classification of Acoustic Scenes and Events (DCASE 2019) and is shown to obtain noticeable performance gains over the state-of-the-art methods in the literature.</description><subject>639/166/987</subject><subject>704/172/4081</subject><subject>Acoustics</subject><subject>Classification</subject><subject>Deep learning</subject><subject>Humanities and Social Sciences</subject><subject>multidisciplinary</subject><subject>Neural networks</subject><subject>Science</subject><subject>Science (multidisciplinary)</subject><subject>Temporal variations</subject><issn>2045-2322</issn><issn>2045-2322</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2022</creationdate><recordtype>article</recordtype><sourceid>PIMPY</sourceid><sourceid>DOA</sourceid><recordid>eNp9kktv1DAUhSMEolXpH2AViQ2bgB-JHxukquJRqRIbWFvXzs2MB4892Amo_fV4JhVQFnhjy_ecT77Xp2leUvKGEq7elp4OWnWEsY5KJXh3_6Q5Z6QfOsYZe_rX-ay5LGVH6hqY7ql-3pzxQQtGGD1vDlcuLWX2ri0OI7YuQCl-8g5mn2JroeDY1sO8zYjd6PcYSy1AaPdLmH3nthAjhnZCmJeMnUs5Y4C5ukbEQxsQcvRx00acf6b8rbxonk0QCl4-7BfN1w_vv1x_6m4_f7y5vrrt3KD43AGnliNHoSwwiUTQYdKDsArAiVGMinPL9KRAEqspDgIUIZIdhXTk2vKL5mbljgl25pD9HvKdSeDN6SLljYFc-w5oeset0wOT3JJecaYJcqkkSGF7LRSprHcr67DYPY51UHOG8Aj6uBL91mzSD6O57IUUFfD6AZDT9wXLbPa-zjsEiFjHb5isn3FsYKjSV_9Id2nJdeAnFeWCEH0EslXlciol4_T7MZSYYz7Mmg9T82FO-TD31cRXU6niuMH8B_0f1y8mIr3a</recordid><startdate>20220812</startdate><enddate>20220812</enddate><creator>Qu, Yuanyuan</creator><creator>Li, Xuesheng</creator><creator>Qin, Zhiliang</creator><creator>Lu, Qidong</creator><general>Nature Publishing Group UK</general><general>Nature Publishing Group</general><general>Nature Portfolio</general><scope>C6C</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>3V.</scope><scope>7X7</scope><scope>7XB</scope><scope>88A</scope><scope>88E</scope><scope>88I</scope><scope>8FE</scope><scope>8FH</scope><scope>8FI</scope><scope>8FJ</scope><scope>8FK</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BBNVY</scope><scope>BENPR</scope><scope>BHPHI</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>FYUFA</scope><scope>GHDGH</scope><scope>GNUQQ</scope><scope>HCIFZ</scope><scope>K9.</scope><scope>LK8</scope><scope>M0S</scope><scope>M1P</scope><scope>M2P</scope><scope>M7P</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>Q9U</scope><scope>7X8</scope><scope>5PM</scope><scope>DOA</scope></search><sort><creationdate>20220812</creationdate><title>Acoustic scene classification based on three-dimensional multi-channel feature-correlated deep learning networks</title><author>Qu, Yuanyuan ; Li, Xuesheng ; Qin, Zhiliang ; Lu, Qidong</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c583t-a31b3e3e68ba27e0615f956b8aac6d6d833b29f8a70b91e56a80072e0611d39b3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2022</creationdate><topic>639/166/987</topic><topic>704/172/4081</topic><topic>Acoustics</topic><topic>Classification</topic><topic>Deep learning</topic><topic>Humanities and Social Sciences</topic><topic>multidisciplinary</topic><topic>Neural networks</topic><topic>Science</topic><topic>Science (multidisciplinary)</topic><topic>Temporal variations</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Qu, Yuanyuan</creatorcontrib><creatorcontrib>Li, Xuesheng</creatorcontrib><creatorcontrib>Qin, Zhiliang</creatorcontrib><creatorcontrib>Lu, Qidong</creatorcontrib><collection>SpringerOpen</collection><collection>CrossRef</collection><collection>ProQuest Central (Corporate)</collection><collection>ProQuest_Health &amp; Medical Collection</collection><collection>ProQuest Central (purchase pre-March 2016)</collection><collection>Biology Database (Alumni Edition)</collection><collection>Medical Database (Alumni Edition)</collection><collection>Science Database (Alumni Edition)</collection><collection>ProQuest SciTech Collection</collection><collection>ProQuest Natural Science Collection</collection><collection>Hospital Premium Collection</collection><collection>Hospital Premium Collection (Alumni Edition)</collection><collection>ProQuest Central (Alumni) (purchase pre-March 2016)</collection><collection>ProQuest Central (Alumni)</collection><collection>ProQuest Central</collection><collection>ProQuest Central Essentials</collection><collection>Biological Science Collection</collection><collection>ProQuest Central</collection><collection>Natural Science Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>Health Research Premium Collection</collection><collection>Health Research Premium Collection (Alumni)</collection><collection>ProQuest Central Student</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Health &amp; Medical Complete (Alumni)</collection><collection>ProQuest Biological Science Collection</collection><collection>Health &amp; Medical Collection (Alumni Edition)</collection><collection>PML(ProQuest Medical Library)</collection><collection>Science Database</collection><collection>ProQuest Biological Science Journals</collection><collection>ProQuest Publicly Available Content database</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central Basic</collection><collection>MEDLINE - Academic</collection><collection>PubMed Central (Full Participant titles)</collection><collection>DOAJ Directory of Open Access Journals</collection><jtitle>Scientific reports</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Qu, Yuanyuan</au><au>Li, Xuesheng</au><au>Qin, Zhiliang</au><au>Lu, Qidong</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Acoustic scene classification based on three-dimensional multi-channel feature-correlated deep learning networks</atitle><jtitle>Scientific reports</jtitle><stitle>Sci Rep</stitle><date>2022-08-12</date><risdate>2022</risdate><volume>12</volume><issue>1</issue><spage>13730</spage><epage>13730</epage><pages>13730-13730</pages><artnum>13730</artnum><issn>2045-2322</issn><eissn>2045-2322</eissn><abstract>As an effective approach to perceive environments, acoustic scene classification (ASC) has received considerable attention in the past few years. Generally, ASC is deemed a challenging task due to subtle differences between various classes of environmental sounds. In this paper, we propose a novel approach to perform accurate classification based on the aggregation of spatial–temporal features extracted from a multi-branch three-dimensional (3D) convolution neural network (CNN) model. The novelties of this paper are as follows. First, we form multiple frequency-domain representations of signals by fully utilizing expert knowledge on acoustics and discrete wavelet transformations (DWT). Secondly, we propose a novel 3D CNN architecture featuring residual connections and squeeze-and-excitation attentions (3D-SE-ResNet) to effectively capture both long-term and short-term correlations inherent in environmental sounds. Thirdly, an auxiliary supervised branch based on the chromatogram of the original signal is incorporated in the proposed architecture to alleviate overfitting risks by providing supplementary information to the model. The performance of the proposed multi-input multi-feature 3D-CNN architecture is numerically evaluated on a typical large-scale dataset in the 2019 IEEE AASP Challenge on Detection and Classification of Acoustic Scenes and Events (DCASE 2019) and is shown to obtain noticeable performance gains over the state-of-the-art methods in the literature.</abstract><cop>London</cop><pub>Nature Publishing Group UK</pub><pmid>35962021</pmid><doi>10.1038/s41598-022-17863-z</doi><tpages>1</tpages><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier ISSN: 2045-2322
ispartof Scientific reports, 2022-08, Vol.12 (1), p.13730-13730, Article 13730
issn 2045-2322
2045-2322
language eng
recordid cdi_doaj_primary_oai_doaj_org_article_4c3bc95273b0483290e3787a76b49680
source PubMed Central Free; Free Full-Text Journals in Chemistry; Springer Nature - nature.com Journals - Fully Open Access; ProQuest Publicly Available Content database
subjects 639/166/987
704/172/4081
Acoustics
Classification
Deep learning
Humanities and Social Sciences
multidisciplinary
Neural networks
Science
Science (multidisciplinary)
Temporal variations
title Acoustic scene classification based on three-dimensional multi-channel feature-correlated deep learning networks
url http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-24T12%3A12%3A49IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_doaj_&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Acoustic%20scene%20classification%20based%20on%20three-dimensional%20multi-channel%20feature-correlated%20deep%20learning%20networks&rft.jtitle=Scientific%20reports&rft.au=Qu,%20Yuanyuan&rft.date=2022-08-12&rft.volume=12&rft.issue=1&rft.spage=13730&rft.epage=13730&rft.pages=13730-13730&rft.artnum=13730&rft.issn=2045-2322&rft.eissn=2045-2322&rft_id=info:doi/10.1038/s41598-022-17863-z&rft_dat=%3Cproquest_doaj_%3E2701360096%3C/proquest_doaj_%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-c583t-a31b3e3e68ba27e0615f956b8aac6d6d833b29f8a70b91e56a80072e0611d39b3%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_pqid=2701360096&rft_id=info:pmid/35962021&rfr_iscdi=true