Loading…

Unsupervised sound localization via iterative contrastive learning

Sound localization aims to find the source of the audio signal in the visual scene. However, it is labor-intensive to annotate the correlations between the signals sampled from the audio and visual modalities, thus making it difficult to supervise the learning of a machine for this task. In this wor...

Full description

Saved in:

Bibliographic Details
Published in:	Computer vision and image understanding 2023-01, Vol.227, p.103602, Article 103602
Main Authors:	Lin, Yan-Bo, Tseng, Hung-Yu, Lee, Hsin-Ying, Lin, Yen-Yu, Yang, Ming-Hsuan
Format:	Article
Language:	English
Subjects:	Contrastive learning Sound localization Unsupervised Learning
Citations:	Items that this one cites Items that cite this one
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

cited_by	cdi_FETCH-LOGICAL-c300t-16c6d734408e6843fa77facf65228258a50cae2283369e799dde511a02ba27553
cites	cdi_FETCH-LOGICAL-c300t-16c6d734408e6843fa77facf65228258a50cae2283369e799dde511a02ba27553
container_end_page
container_issue
container_start_page	103602
container_title	Computer vision and image understanding
container_volume	227
creator	Lin, Yan-Bo Tseng, Hung-Yu Lee, Hsin-Ying Lin, Yen-Yu Yang, Ming-Hsuan
description	Sound localization aims to find the source of the audio signal in the visual scene. However, it is labor-intensive to annotate the correlations between the signals sampled from the audio and visual modalities, thus making it difficult to supervise the learning of a machine for this task. In this work, we propose an iterative contrastive learning framework that requires no data annotations. At each iteration, the proposed method takes the (1) localization results in images predicted in the previous iteration, and (2) semantic relationships inferred from the audio signals as the pseudo-labels. We then use the pseudo-labels to learn the correlation between the visual and audio signals sampled from the same video (intra-frame sampling) as well as the association between those extracted across videos (inter-frame relation). Our iterative strategy gradually encourages the localization of the sounding objects and reduces the correlation between the non-sounding regions and the reference audio. Quantitative and qualitative experimental results demonstrate that the proposed framework performs favorably against existing unsupervised and weakly-supervised methods on the sound localization task.
doi_str_mv	10.1016/j.cviu.2022.103602
format	article
fullrecord	<record><control><sourceid>elsevier_cross</sourceid><recordid>TN_cdi_crossref_primary_10_1016_j_cviu_2022_103602</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><els_id>S1077314222001801</els_id><sourcerecordid>S1077314222001801</sourcerecordid><originalsourceid>FETCH-LOGICAL-c300t-16c6d734408e6843fa77facf65228258a50cae2283369e799dde511a02ba27553</originalsourceid><addsrcrecordid>eNp9kE1LxDAQhoMouK7-AU_9A10nSZO04EXFL1jw4oK3MCZTyVLTJekW9Nfbup49zTMvvMPwMHbJYcWB66vtyo1hvxIgxBRIDeKILTg0UAqp3o5nNqaUvBKn7CznLQDnVcMX7HYT835HaQyZfJH7ffRF1zvswjcOoY_FGLAIA6VpG6lwfRwS5l_uCFMM8eOcnbTYZbr4m0u2ebh_vXsq1y-Pz3c369JJgKHk2mlvZFVBTbquZIvGtOharYSohapRgUOaWErdkGka70lxjiDeURil5JKJw12X-pwTtXaXwiemL8vBzhbs1s4W7GzBHixMpetDiabPxkDJZhcoOvIhkRus78N_9R9PYmaM</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Unsupervised sound localization via iterative contrastive learning</title><source>Elsevier</source><creator>Lin, Yan-Bo ; Tseng, Hung-Yu ; Lee, Hsin-Ying ; Lin, Yen-Yu ; Yang, Ming-Hsuan</creator><creatorcontrib>Lin, Yan-Bo ; Tseng, Hung-Yu ; Lee, Hsin-Ying ; Lin, Yen-Yu ; Yang, Ming-Hsuan</creatorcontrib><description>Sound localization aims to find the source of the audio signal in the visual scene. However, it is labor-intensive to annotate the correlations between the signals sampled from the audio and visual modalities, thus making it difficult to supervise the learning of a machine for this task. In this work, we propose an iterative contrastive learning framework that requires no data annotations. At each iteration, the proposed method takes the (1) localization results in images predicted in the previous iteration, and (2) semantic relationships inferred from the audio signals as the pseudo-labels. We then use the pseudo-labels to learn the correlation between the visual and audio signals sampled from the same video (intra-frame sampling) as well as the association between those extracted across videos (inter-frame relation). Our iterative strategy gradually encourages the localization of the sounding objects and reduces the correlation between the non-sounding regions and the reference audio. Quantitative and qualitative experimental results demonstrate that the proposed framework performs favorably against existing unsupervised and weakly-supervised methods on the sound localization task.</description><identifier>ISSN: 1077-3142</identifier><identifier>EISSN: 1090-235X</identifier><identifier>DOI: 10.1016/j.cviu.2022.103602</identifier><language>eng</language><publisher>Elsevier Inc</publisher><subject>Contrastive learning ; Sound localization ; Unsupervised Learning</subject><ispartof>Computer vision and image understanding, 2023-01, Vol.227, p.103602, Article 103602</ispartof><rights>2022 Elsevier Inc.</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c300t-16c6d734408e6843fa77facf65228258a50cae2283369e799dde511a02ba27553</citedby><cites>FETCH-LOGICAL-c300t-16c6d734408e6843fa77facf65228258a50cae2283369e799dde511a02ba27553</cites><orcidid>0000-0002-7183-6070 ; 0000-0003-4848-2304</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>314,780,784,27924,27925</link.rule.ids></links><search><creatorcontrib>Lin, Yan-Bo</creatorcontrib><creatorcontrib>Tseng, Hung-Yu</creatorcontrib><creatorcontrib>Lee, Hsin-Ying</creatorcontrib><creatorcontrib>Lin, Yen-Yu</creatorcontrib><creatorcontrib>Yang, Ming-Hsuan</creatorcontrib><title>Unsupervised sound localization via iterative contrastive learning</title><title>Computer vision and image understanding</title><description>Sound localization aims to find the source of the audio signal in the visual scene. However, it is labor-intensive to annotate the correlations between the signals sampled from the audio and visual modalities, thus making it difficult to supervise the learning of a machine for this task. In this work, we propose an iterative contrastive learning framework that requires no data annotations. At each iteration, the proposed method takes the (1) localization results in images predicted in the previous iteration, and (2) semantic relationships inferred from the audio signals as the pseudo-labels. We then use the pseudo-labels to learn the correlation between the visual and audio signals sampled from the same video (intra-frame sampling) as well as the association between those extracted across videos (inter-frame relation). Our iterative strategy gradually encourages the localization of the sounding objects and reduces the correlation between the non-sounding regions and the reference audio. Quantitative and qualitative experimental results demonstrate that the proposed framework performs favorably against existing unsupervised and weakly-supervised methods on the sound localization task.</description><subject>Contrastive learning</subject><subject>Sound localization</subject><subject>Unsupervised Learning</subject><issn>1077-3142</issn><issn>1090-235X</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><recordid>eNp9kE1LxDAQhoMouK7-AU_9A10nSZO04EXFL1jw4oK3MCZTyVLTJekW9Nfbup49zTMvvMPwMHbJYcWB66vtyo1hvxIgxBRIDeKILTg0UAqp3o5nNqaUvBKn7CznLQDnVcMX7HYT835HaQyZfJH7ffRF1zvswjcOoY_FGLAIA6VpG6lwfRwS5l_uCFMM8eOcnbTYZbr4m0u2ebh_vXsq1y-Pz3c369JJgKHk2mlvZFVBTbquZIvGtOharYSohapRgUOaWErdkGka70lxjiDeURil5JKJw12X-pwTtXaXwiemL8vBzhbs1s4W7GzBHixMpetDiabPxkDJZhcoOvIhkRus78N_9R9PYmaM</recordid><startdate>202301</startdate><enddate>202301</enddate><creator>Lin, Yan-Bo</creator><creator>Tseng, Hung-Yu</creator><creator>Lee, Hsin-Ying</creator><creator>Lin, Yen-Yu</creator><creator>Yang, Ming-Hsuan</creator><general>Elsevier Inc</general><scope>AAYXX</scope><scope>CITATION</scope><orcidid>https://orcid.org/0000-0002-7183-6070</orcidid><orcidid>https://orcid.org/0000-0003-4848-2304</orcidid></search><sort><creationdate>202301</creationdate><title>Unsupervised sound localization via iterative contrastive learning</title><author>Lin, Yan-Bo ; Tseng, Hung-Yu ; Lee, Hsin-Ying ; Lin, Yen-Yu ; Yang, Ming-Hsuan</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c300t-16c6d734408e6843fa77facf65228258a50cae2283369e799dde511a02ba27553</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Contrastive learning</topic><topic>Sound localization</topic><topic>Unsupervised Learning</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Lin, Yan-Bo</creatorcontrib><creatorcontrib>Tseng, Hung-Yu</creatorcontrib><creatorcontrib>Lee, Hsin-Ying</creatorcontrib><creatorcontrib>Lin, Yen-Yu</creatorcontrib><creatorcontrib>Yang, Ming-Hsuan</creatorcontrib><collection>CrossRef</collection><jtitle>Computer vision and image understanding</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Lin, Yan-Bo</au><au>Tseng, Hung-Yu</au><au>Lee, Hsin-Ying</au><au>Lin, Yen-Yu</au><au>Yang, Ming-Hsuan</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Unsupervised sound localization via iterative contrastive learning</atitle><jtitle>Computer vision and image understanding</jtitle><date>2023-01</date><risdate>2023</risdate><volume>227</volume><spage>103602</spage><pages>103602-</pages><artnum>103602</artnum><issn>1077-3142</issn><eissn>1090-235X</eissn><abstract>Sound localization aims to find the source of the audio signal in the visual scene. However, it is labor-intensive to annotate the correlations between the signals sampled from the audio and visual modalities, thus making it difficult to supervise the learning of a machine for this task. In this work, we propose an iterative contrastive learning framework that requires no data annotations. At each iteration, the proposed method takes the (1) localization results in images predicted in the previous iteration, and (2) semantic relationships inferred from the audio signals as the pseudo-labels. We then use the pseudo-labels to learn the correlation between the visual and audio signals sampled from the same video (intra-frame sampling) as well as the association between those extracted across videos (inter-frame relation). Our iterative strategy gradually encourages the localization of the sounding objects and reduces the correlation between the non-sounding regions and the reference audio. Quantitative and qualitative experimental results demonstrate that the proposed framework performs favorably against existing unsupervised and weakly-supervised methods on the sound localization task.</abstract><pub>Elsevier Inc</pub><doi>10.1016/j.cviu.2022.103602</doi><orcidid>https://orcid.org/0000-0002-7183-6070</orcidid><orcidid>https://orcid.org/0000-0003-4848-2304</orcidid></addata></record>
fulltext	fulltext
identifier	ISSN: 1077-3142
ispartof	Computer vision and image understanding, 2023-01, Vol.227, p.103602, Article 103602
issn	1077-3142 1090-235X
language	eng
recordid	cdi_crossref_primary_10_1016_j_cviu_2022_103602
source	Elsevier
subjects	Contrastive learning Sound localization Unsupervised Learning
title	Unsupervised sound localization via iterative contrastive learning
url	http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-29T16%3A50%3A35IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-elsevier_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Unsupervised%20sound%20localization%20via%20iterative%20contrastive%20learning&rft.jtitle=Computer%20vision%20and%20image%20understanding&rft.au=Lin,%20Yan-Bo&rft.date=2023-01&rft.volume=227&rft.spage=103602&rft.pages=103602-&rft.artnum=103602&rft.issn=1077-3142&rft.eissn=1090-235X&rft_id=info:doi/10.1016/j.cviu.2022.103602&rft_dat=%3Celsevier_cross%3ES1077314222001801%3C/elsevier_cross%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-c300t-16c6d734408e6843fa77facf65228258a50cae2283369e799dde511a02ba27553%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true