Loading…

Modeling Beats and Downbeats with a Time-Frequency Transformer

Transformer is a successful deep neural network (DNN) architecture that has shown its versatility not only in natural language processing but also in music information retrieval (MIR). In this paper, we present a novel Transformer-based approach to tackle beat and downbeat tracking. This approach em...

Full description

Saved in:

Bibliographic Details
Main Authors:	Hung, Yun-Ning, Wang, Ju-Chiang, Song, Xuchen, Lu, Wei-Tsung, Won, Minz
Format:	Conference Proceeding
Language:	English
Subjects:	Beat Convolution Deep learning Downbeat Harmonic analysis Natural language processing Neural networks SpecTNT Time-frequency analysis Transformer Transformers
Online Access:	Request full text
Tags:	Add Tag No Tags, Be the first to tag this record!

cited_by
cites
container_end_page	405
container_issue
container_start_page	401
container_title
container_volume
creator	Hung, Yun-Ning Wang, Ju-Chiang Song, Xuchen Lu, Wei-Tsung Won, Minz
description	Transformer is a successful deep neural network (DNN) architecture that has shown its versatility not only in natural language processing but also in music information retrieval (MIR). In this paper, we present a novel Transformer-based approach to tackle beat and downbeat tracking. This approach employs SpecTNT (Spectral- Temporal Transformer in Transformer), a variant of Transformer that models both spectral and temporal dimensions of a time-frequency input of music audio. A SpecTNT model uses a stack of blocks, where each consists of two levels of Transformer encoders. The lower-level (or spectral) encoder handles the spectral features and enables the model to pay attention to harmonic components of each frame. Since downbeats indicate bar boundaries and are often accompanied by harmonic changes, this step may help downbeat modeling. The upper-level (or temporal) encoder aggregates useful local spectral information to pay attention to beat/downbeat positions. We also propose an architecture that combines SpecTNT with a state-of- the-art model, Temporal Convolutional Networks (TCN), to further improve the performance. Extensive experiments demonstrate that our approach can significantly outperform TCN in downbeat tracking while maintaining comparable result in beat tracking.
doi_str_mv	10.1109/ICASSP43922.2022.9747048
format	conference_proceeding
fullrecord	<record><control><sourceid>ieee_CHZPO</sourceid><recordid>TN_cdi_ieee_primary_9747048</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>9747048</ieee_id><sourcerecordid>9747048</sourcerecordid><originalsourceid>FETCH-LOGICAL-i203t-d461afbe9da6ed0e626a60d0a9a564b814e2cebb93faced61bfa908f619a35e63</originalsourceid><addsrcrecordid>eNotj91Kw0AUhFdBsK0-gTf7Aqlnf7LJ3gharQoVhUbwrpzNntWVJtFNpPTtDVoYZvhuhhnGuIC5EGAvHxfX6_WLVlbKuYTRbKEL0OURmwpjcg2jzDGbSFXYTFh4O2XTvv8EgLLQ5YRdPXWetrF95zeEQ8-x9fy227Xuj3Zx-ODIq9hQtkz0_UNtvedVwrYPXWoonbGTgNuezg85Y6_Lu2rxkK2e78dpqyxKUEPmtREYHFmPhjyQkQYNeECLudGuFJpkTc5ZFbAmb4QLaKEMRlhUORk1Yxf_vZGINl8pNpj2m8NX9Qt1UUrc</addsrcrecordid><sourcetype>Publisher</sourcetype><iscdi>true</iscdi><recordtype>conference_proceeding</recordtype></control><display><type>conference_proceeding</type><title>Modeling Beats and Downbeats with a Time-Frequency Transformer</title><source>IEEE Xplore All Conference Series</source><creator>Hung, Yun-Ning ; Wang, Ju-Chiang ; Song, Xuchen ; Lu, Wei-Tsung ; Won, Minz</creator><creatorcontrib>Hung, Yun-Ning ; Wang, Ju-Chiang ; Song, Xuchen ; Lu, Wei-Tsung ; Won, Minz</creatorcontrib><description>Transformer is a successful deep neural network (DNN) architecture that has shown its versatility not only in natural language processing but also in music information retrieval (MIR). In this paper, we present a novel Transformer-based approach to tackle beat and downbeat tracking. This approach employs SpecTNT (Spectral- Temporal Transformer in Transformer), a variant of Transformer that models both spectral and temporal dimensions of a time-frequency input of music audio. A SpecTNT model uses a stack of blocks, where each consists of two levels of Transformer encoders. The lower-level (or spectral) encoder handles the spectral features and enables the model to pay attention to harmonic components of each frame. Since downbeats indicate bar boundaries and are often accompanied by harmonic changes, this step may help downbeat modeling. The upper-level (or temporal) encoder aggregates useful local spectral information to pay attention to beat/downbeat positions. We also propose an architecture that combines SpecTNT with a state-of- the-art model, Temporal Convolutional Networks (TCN), to further improve the performance. Extensive experiments demonstrate that our approach can significantly outperform TCN in downbeat tracking while maintaining comparable result in beat tracking.</description><identifier>EISSN: 2379-190X</identifier><identifier>EISBN: 1665405406</identifier><identifier>EISBN: 9781665405409</identifier><identifier>DOI: 10.1109/ICASSP43922.2022.9747048</identifier><language>eng</language><publisher>IEEE</publisher><subject>Beat ; Convolution ; Deep learning ; Downbeat ; Harmonic analysis ; Natural language processing ; Neural networks ; SpecTNT ; Time-frequency analysis ; Transformer ; Transformers</subject><ispartof>ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022, p.401-405</ispartof><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/9747048$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>309,310,776,780,785,786,23909,23910,25118,27902,54530,54907</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/9747048$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc></links><search><creatorcontrib>Hung, Yun-Ning</creatorcontrib><creatorcontrib>Wang, Ju-Chiang</creatorcontrib><creatorcontrib>Song, Xuchen</creatorcontrib><creatorcontrib>Lu, Wei-Tsung</creatorcontrib><creatorcontrib>Won, Minz</creatorcontrib><title>Modeling Beats and Downbeats with a Time-Frequency Transformer</title><title>ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)</title><addtitle>ICASSP</addtitle><description>Transformer is a successful deep neural network (DNN) architecture that has shown its versatility not only in natural language processing but also in music information retrieval (MIR). In this paper, we present a novel Transformer-based approach to tackle beat and downbeat tracking. This approach employs SpecTNT (Spectral- Temporal Transformer in Transformer), a variant of Transformer that models both spectral and temporal dimensions of a time-frequency input of music audio. A SpecTNT model uses a stack of blocks, where each consists of two levels of Transformer encoders. The lower-level (or spectral) encoder handles the spectral features and enables the model to pay attention to harmonic components of each frame. Since downbeats indicate bar boundaries and are often accompanied by harmonic changes, this step may help downbeat modeling. The upper-level (or temporal) encoder aggregates useful local spectral information to pay attention to beat/downbeat positions. We also propose an architecture that combines SpecTNT with a state-of- the-art model, Temporal Convolutional Networks (TCN), to further improve the performance. Extensive experiments demonstrate that our approach can significantly outperform TCN in downbeat tracking while maintaining comparable result in beat tracking.</description><subject>Beat</subject><subject>Convolution</subject><subject>Deep learning</subject><subject>Downbeat</subject><subject>Harmonic analysis</subject><subject>Natural language processing</subject><subject>Neural networks</subject><subject>SpecTNT</subject><subject>Time-frequency analysis</subject><subject>Transformer</subject><subject>Transformers</subject><issn>2379-190X</issn><isbn>1665405406</isbn><isbn>9781665405409</isbn><fulltext>true</fulltext><rsrctype>conference_proceeding</rsrctype><creationdate>2022</creationdate><recordtype>conference_proceeding</recordtype><sourceid>6IE</sourceid><recordid>eNotj91Kw0AUhFdBsK0-gTf7Aqlnf7LJ3gharQoVhUbwrpzNntWVJtFNpPTtDVoYZvhuhhnGuIC5EGAvHxfX6_WLVlbKuYTRbKEL0OURmwpjcg2jzDGbSFXYTFh4O2XTvv8EgLLQ5YRdPXWetrF95zeEQ8-x9fy227Xuj3Zx-ODIq9hQtkz0_UNtvedVwrYPXWoonbGTgNuezg85Y6_Lu2rxkK2e78dpqyxKUEPmtREYHFmPhjyQkQYNeECLudGuFJpkTc5ZFbAmb4QLaKEMRlhUORk1Yxf_vZGINl8pNpj2m8NX9Qt1UUrc</recordid><startdate>20220523</startdate><enddate>20220523</enddate><creator>Hung, Yun-Ning</creator><creator>Wang, Ju-Chiang</creator><creator>Song, Xuchen</creator><creator>Lu, Wei-Tsung</creator><creator>Won, Minz</creator><general>IEEE</general><scope>6IE</scope><scope>6IH</scope><scope>CBEJK</scope><scope>RIE</scope><scope>RIO</scope></search><sort><creationdate>20220523</creationdate><title>Modeling Beats and Downbeats with a Time-Frequency Transformer</title><author>Hung, Yun-Ning ; Wang, Ju-Chiang ; Song, Xuchen ; Lu, Wei-Tsung ; Won, Minz</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-i203t-d461afbe9da6ed0e626a60d0a9a564b814e2cebb93faced61bfa908f619a35e63</frbrgroupid><rsrctype>conference_proceedings</rsrctype><prefilter>conference_proceedings</prefilter><language>eng</language><creationdate>2022</creationdate><topic>Beat</topic><topic>Convolution</topic><topic>Deep learning</topic><topic>Downbeat</topic><topic>Harmonic analysis</topic><topic>Natural language processing</topic><topic>Neural networks</topic><topic>SpecTNT</topic><topic>Time-frequency analysis</topic><topic>Transformer</topic><topic>Transformers</topic><toplevel>online_resources</toplevel><creatorcontrib>Hung, Yun-Ning</creatorcontrib><creatorcontrib>Wang, Ju-Chiang</creatorcontrib><creatorcontrib>Song, Xuchen</creatorcontrib><creatorcontrib>Lu, Wei-Tsung</creatorcontrib><creatorcontrib>Won, Minz</creatorcontrib><collection>IEEE Electronic Library (IEL) Conference Proceedings</collection><collection>IEEE Proceedings Order Plan (POP) 1998-present by volume</collection><collection>IEEE Xplore All Conference Proceedings</collection><collection>IEEE Electronic Library (IEL)</collection><collection>IEEE Proceedings Order Plans (POP) 1998-present</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Hung, Yun-Ning</au><au>Wang, Ju-Chiang</au><au>Song, Xuchen</au><au>Lu, Wei-Tsung</au><au>Won, Minz</au><format>book</format><genre>proceeding</genre><ristype>CONF</ristype><atitle>Modeling Beats and Downbeats with a Time-Frequency Transformer</atitle><btitle>ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)</btitle><stitle>ICASSP</stitle><date>2022-05-23</date><risdate>2022</risdate><spage>401</spage><epage>405</epage><pages>401-405</pages><eissn>2379-190X</eissn><eisbn>1665405406</eisbn><eisbn>9781665405409</eisbn><abstract>Transformer is a successful deep neural network (DNN) architecture that has shown its versatility not only in natural language processing but also in music information retrieval (MIR). In this paper, we present a novel Transformer-based approach to tackle beat and downbeat tracking. This approach employs SpecTNT (Spectral- Temporal Transformer in Transformer), a variant of Transformer that models both spectral and temporal dimensions of a time-frequency input of music audio. A SpecTNT model uses a stack of blocks, where each consists of two levels of Transformer encoders. The lower-level (or spectral) encoder handles the spectral features and enables the model to pay attention to harmonic components of each frame. Since downbeats indicate bar boundaries and are often accompanied by harmonic changes, this step may help downbeat modeling. The upper-level (or temporal) encoder aggregates useful local spectral information to pay attention to beat/downbeat positions. We also propose an architecture that combines SpecTNT with a state-of- the-art model, Temporal Convolutional Networks (TCN), to further improve the performance. Extensive experiments demonstrate that our approach can significantly outperform TCN in downbeat tracking while maintaining comparable result in beat tracking.</abstract><pub>IEEE</pub><doi>10.1109/ICASSP43922.2022.9747048</doi><tpages>5</tpages></addata></record>
fulltext	fulltext_linktorsrc
identifier	EISSN: 2379-190X
ispartof	ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022, p.401-405
issn	2379-190X
language	eng
recordid	cdi_ieee_primary_9747048
source	IEEE Xplore All Conference Series
subjects	Beat Convolution Deep learning Downbeat Harmonic analysis Natural language processing Neural networks SpecTNT Time-frequency analysis Transformer Transformers
title	Modeling Beats and Downbeats with a Time-Frequency Transformer
url	http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-31T07%3A28%3A36IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-ieee_CHZPO&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=proceeding&rft.atitle=Modeling%20Beats%20and%20Downbeats%20with%20a%20Time-Frequency%20Transformer&rft.btitle=ICASSP%202022%20-%202022%20IEEE%20International%20Conference%20on%20Acoustics,%20Speech%20and%20Signal%20Processing%20(ICASSP)&rft.au=Hung,%20Yun-Ning&rft.date=2022-05-23&rft.spage=401&rft.epage=405&rft.pages=401-405&rft.eissn=2379-190X&rft_id=info:doi/10.1109/ICASSP43922.2022.9747048&rft.eisbn=1665405406&rft.eisbn_list=9781665405409&rft_dat=%3Cieee_CHZPO%3E9747048%3C/ieee_CHZPO%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-i203t-d461afbe9da6ed0e626a60d0a9a564b814e2cebb93faced61bfa908f619a35e63%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_id=info:pmid/&rft_ieee_id=9747048&rfr_iscdi=true