Loading…

Modeling Beats and Downbeats with a Time-Frequency Transformer

Transformer is a successful deep neural network (DNN) architecture that has shown its versatility not only in natural language processing but also in music information retrieval (MIR). In this paper, we present a novel Transformer-based approach to tackle beat and downbeat tracking. This approach em...

Full description

Saved in:
Bibliographic Details
Main Authors: Hung, Yun-Ning, Wang, Ju-Chiang, Song, Xuchen, Lu, Wei-Tsung, Won, Minz
Format: Conference Proceeding
Language:English
Subjects:
Online Access:Request full text
Tags: Add Tag
No Tags, Be the first to tag this record!
cited_by
cites
container_end_page 405
container_issue
container_start_page 401
container_title
container_volume
creator Hung, Yun-Ning
Wang, Ju-Chiang
Song, Xuchen
Lu, Wei-Tsung
Won, Minz
description Transformer is a successful deep neural network (DNN) architecture that has shown its versatility not only in natural language processing but also in music information retrieval (MIR). In this paper, we present a novel Transformer-based approach to tackle beat and downbeat tracking. This approach employs SpecTNT (Spectral- Temporal Transformer in Transformer), a variant of Transformer that models both spectral and temporal dimensions of a time-frequency input of music audio. A SpecTNT model uses a stack of blocks, where each consists of two levels of Transformer encoders. The lower-level (or spectral) encoder handles the spectral features and enables the model to pay attention to harmonic components of each frame. Since downbeats indicate bar boundaries and are often accompanied by harmonic changes, this step may help downbeat modeling. The upper-level (or temporal) encoder aggregates useful local spectral information to pay attention to beat/downbeat positions. We also propose an architecture that combines SpecTNT with a state-of- the-art model, Temporal Convolutional Networks (TCN), to further improve the performance. Extensive experiments demonstrate that our approach can significantly outperform TCN in downbeat tracking while maintaining comparable result in beat tracking.
doi_str_mv 10.1109/ICASSP43922.2022.9747048
format conference_proceeding
fullrecord <record><control><sourceid>ieee_CHZPO</sourceid><recordid>TN_cdi_ieee_primary_9747048</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>9747048</ieee_id><sourcerecordid>9747048</sourcerecordid><originalsourceid>FETCH-LOGICAL-i203t-d461afbe9da6ed0e626a60d0a9a564b814e2cebb93faced61bfa908f619a35e63</originalsourceid><addsrcrecordid>eNotj91Kw0AUhFdBsK0-gTf7Aqlnf7LJ3gharQoVhUbwrpzNntWVJtFNpPTtDVoYZvhuhhnGuIC5EGAvHxfX6_WLVlbKuYTRbKEL0OURmwpjcg2jzDGbSFXYTFh4O2XTvv8EgLLQ5YRdPXWetrF95zeEQ8-x9fy227Xuj3Zx-ODIq9hQtkz0_UNtvedVwrYPXWoonbGTgNuezg85Y6_Lu2rxkK2e78dpqyxKUEPmtREYHFmPhjyQkQYNeECLudGuFJpkTc5ZFbAmb4QLaKEMRlhUORk1Yxf_vZGINl8pNpj2m8NX9Qt1UUrc</addsrcrecordid><sourcetype>Publisher</sourcetype><iscdi>true</iscdi><recordtype>conference_proceeding</recordtype></control><display><type>conference_proceeding</type><title>Modeling Beats and Downbeats with a Time-Frequency Transformer</title><source>IEEE Xplore All Conference Series</source><creator>Hung, Yun-Ning ; Wang, Ju-Chiang ; Song, Xuchen ; Lu, Wei-Tsung ; Won, Minz</creator><creatorcontrib>Hung, Yun-Ning ; Wang, Ju-Chiang ; Song, Xuchen ; Lu, Wei-Tsung ; Won, Minz</creatorcontrib><description>Transformer is a successful deep neural network (DNN) architecture that has shown its versatility not only in natural language processing but also in music information retrieval (MIR). In this paper, we present a novel Transformer-based approach to tackle beat and downbeat tracking. This approach employs SpecTNT (Spectral- Temporal Transformer in Transformer), a variant of Transformer that models both spectral and temporal dimensions of a time-frequency input of music audio. A SpecTNT model uses a stack of blocks, where each consists of two levels of Transformer encoders. The lower-level (or spectral) encoder handles the spectral features and enables the model to pay attention to harmonic components of each frame. Since downbeats indicate bar boundaries and are often accompanied by harmonic changes, this step may help downbeat modeling. The upper-level (or temporal) encoder aggregates useful local spectral information to pay attention to beat/downbeat positions. We also propose an architecture that combines SpecTNT with a state-of- the-art model, Temporal Convolutional Networks (TCN), to further improve the performance. Extensive experiments demonstrate that our approach can significantly outperform TCN in downbeat tracking while maintaining comparable result in beat tracking.</description><identifier>EISSN: 2379-190X</identifier><identifier>EISBN: 1665405406</identifier><identifier>EISBN: 9781665405409</identifier><identifier>DOI: 10.1109/ICASSP43922.2022.9747048</identifier><language>eng</language><publisher>IEEE</publisher><subject>Beat ; Convolution ; Deep learning ; Downbeat ; Harmonic analysis ; Natural language processing ; Neural networks ; SpecTNT ; Time-frequency analysis ; Transformer ; Transformers</subject><ispartof>ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022, p.401-405</ispartof><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/9747048$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>309,310,776,780,785,786,23909,23910,25118,27902,54530,54907</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/9747048$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc></links><search><creatorcontrib>Hung, Yun-Ning</creatorcontrib><creatorcontrib>Wang, Ju-Chiang</creatorcontrib><creatorcontrib>Song, Xuchen</creatorcontrib><creatorcontrib>Lu, Wei-Tsung</creatorcontrib><creatorcontrib>Won, Minz</creatorcontrib><title>Modeling Beats and Downbeats with a Time-Frequency Transformer</title><title>ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)</title><addtitle>ICASSP</addtitle><description>Transformer is a successful deep neural network (DNN) architecture that has shown its versatility not only in natural language processing but also in music information retrieval (MIR). In this paper, we present a novel Transformer-based approach to tackle beat and downbeat tracking. This approach employs SpecTNT (Spectral- Temporal Transformer in Transformer), a variant of Transformer that models both spectral and temporal dimensions of a time-frequency input of music audio. A SpecTNT model uses a stack of blocks, where each consists of two levels of Transformer encoders. The lower-level (or spectral) encoder handles the spectral features and enables the model to pay attention to harmonic components of each frame. Since downbeats indicate bar boundaries and are often accompanied by harmonic changes, this step may help downbeat modeling. The upper-level (or temporal) encoder aggregates useful local spectral information to pay attention to beat/downbeat positions. We also propose an architecture that combines SpecTNT with a state-of- the-art model, Temporal Convolutional Networks (TCN), to further improve the performance. Extensive experiments demonstrate that our approach can significantly outperform TCN in downbeat tracking while maintaining comparable result in beat tracking.</description><subject>Beat</subject><subject>Convolution</subject><subject>Deep learning</subject><subject>Downbeat</subject><subject>Harmonic analysis</subject><subject>Natural language processing</subject><subject>Neural networks</subject><subject>SpecTNT</subject><subject>Time-frequency analysis</subject><subject>Transformer</subject><subject>Transformers</subject><issn>2379-190X</issn><isbn>1665405406</isbn><isbn>9781665405409</isbn><fulltext>true</fulltext><rsrctype>conference_proceeding</rsrctype><creationdate>2022</creationdate><recordtype>conference_proceeding</recordtype><sourceid>6IE</sourceid><recordid>eNotj91Kw0AUhFdBsK0-gTf7Aqlnf7LJ3gharQoVhUbwrpzNntWVJtFNpPTtDVoYZvhuhhnGuIC5EGAvHxfX6_WLVlbKuYTRbKEL0OURmwpjcg2jzDGbSFXYTFh4O2XTvv8EgLLQ5YRdPXWetrF95zeEQ8-x9fy227Xuj3Zx-ODIq9hQtkz0_UNtvedVwrYPXWoonbGTgNuezg85Y6_Lu2rxkK2e78dpqyxKUEPmtREYHFmPhjyQkQYNeECLudGuFJpkTc5ZFbAmb4QLaKEMRlhUORk1Yxf_vZGINl8pNpj2m8NX9Qt1UUrc</recordid><startdate>20220523</startdate><enddate>20220523</enddate><creator>Hung, Yun-Ning</creator><creator>Wang, Ju-Chiang</creator><creator>Song, Xuchen</creator><creator>Lu, Wei-Tsung</creator><creator>Won, Minz</creator><general>IEEE</general><scope>6IE</scope><scope>6IH</scope><scope>CBEJK</scope><scope>RIE</scope><scope>RIO</scope></search><sort><creationdate>20220523</creationdate><title>Modeling Beats and Downbeats with a Time-Frequency Transformer</title><author>Hung, Yun-Ning ; Wang, Ju-Chiang ; Song, Xuchen ; Lu, Wei-Tsung ; Won, Minz</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-i203t-d461afbe9da6ed0e626a60d0a9a564b814e2cebb93faced61bfa908f619a35e63</frbrgroupid><rsrctype>conference_proceedings</rsrctype><prefilter>conference_proceedings</prefilter><language>eng</language><creationdate>2022</creationdate><topic>Beat</topic><topic>Convolution</topic><topic>Deep learning</topic><topic>Downbeat</topic><topic>Harmonic analysis</topic><topic>Natural language processing</topic><topic>Neural networks</topic><topic>SpecTNT</topic><topic>Time-frequency analysis</topic><topic>Transformer</topic><topic>Transformers</topic><toplevel>online_resources</toplevel><creatorcontrib>Hung, Yun-Ning</creatorcontrib><creatorcontrib>Wang, Ju-Chiang</creatorcontrib><creatorcontrib>Song, Xuchen</creatorcontrib><creatorcontrib>Lu, Wei-Tsung</creatorcontrib><creatorcontrib>Won, Minz</creatorcontrib><collection>IEEE Electronic Library (IEL) Conference Proceedings</collection><collection>IEEE Proceedings Order Plan (POP) 1998-present by volume</collection><collection>IEEE Xplore All Conference Proceedings</collection><collection>IEEE Electronic Library (IEL)</collection><collection>IEEE Proceedings Order Plans (POP) 1998-present</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Hung, Yun-Ning</au><au>Wang, Ju-Chiang</au><au>Song, Xuchen</au><au>Lu, Wei-Tsung</au><au>Won, Minz</au><format>book</format><genre>proceeding</genre><ristype>CONF</ristype><atitle>Modeling Beats and Downbeats with a Time-Frequency Transformer</atitle><btitle>ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)</btitle><stitle>ICASSP</stitle><date>2022-05-23</date><risdate>2022</risdate><spage>401</spage><epage>405</epage><pages>401-405</pages><eissn>2379-190X</eissn><eisbn>1665405406</eisbn><eisbn>9781665405409</eisbn><abstract>Transformer is a successful deep neural network (DNN) architecture that has shown its versatility not only in natural language processing but also in music information retrieval (MIR). In this paper, we present a novel Transformer-based approach to tackle beat and downbeat tracking. This approach employs SpecTNT (Spectral- Temporal Transformer in Transformer), a variant of Transformer that models both spectral and temporal dimensions of a time-frequency input of music audio. A SpecTNT model uses a stack of blocks, where each consists of two levels of Transformer encoders. The lower-level (or spectral) encoder handles the spectral features and enables the model to pay attention to harmonic components of each frame. Since downbeats indicate bar boundaries and are often accompanied by harmonic changes, this step may help downbeat modeling. The upper-level (or temporal) encoder aggregates useful local spectral information to pay attention to beat/downbeat positions. We also propose an architecture that combines SpecTNT with a state-of- the-art model, Temporal Convolutional Networks (TCN), to further improve the performance. Extensive experiments demonstrate that our approach can significantly outperform TCN in downbeat tracking while maintaining comparable result in beat tracking.</abstract><pub>IEEE</pub><doi>10.1109/ICASSP43922.2022.9747048</doi><tpages>5</tpages></addata></record>
fulltext fulltext_linktorsrc
identifier EISSN: 2379-190X
ispartof ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022, p.401-405
issn 2379-190X
language eng
recordid cdi_ieee_primary_9747048
source IEEE Xplore All Conference Series
subjects Beat
Convolution
Deep learning
Downbeat
Harmonic analysis
Natural language processing
Neural networks
SpecTNT
Time-frequency analysis
Transformer
Transformers
title Modeling Beats and Downbeats with a Time-Frequency Transformer
url http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-31T07%3A28%3A36IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-ieee_CHZPO&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=proceeding&rft.atitle=Modeling%20Beats%20and%20Downbeats%20with%20a%20Time-Frequency%20Transformer&rft.btitle=ICASSP%202022%20-%202022%20IEEE%20International%20Conference%20on%20Acoustics,%20Speech%20and%20Signal%20Processing%20(ICASSP)&rft.au=Hung,%20Yun-Ning&rft.date=2022-05-23&rft.spage=401&rft.epage=405&rft.pages=401-405&rft.eissn=2379-190X&rft_id=info:doi/10.1109/ICASSP43922.2022.9747048&rft.eisbn=1665405406&rft.eisbn_list=9781665405409&rft_dat=%3Cieee_CHZPO%3E9747048%3C/ieee_CHZPO%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-i203t-d461afbe9da6ed0e626a60d0a9a564b814e2cebb93faced61bfa908f619a35e63%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_id=info:pmid/&rft_ieee_id=9747048&rfr_iscdi=true