Loading…

Reliable shot identification for complex event detection via visual-semantic embedding

Multimedia event detection is the task of detecting a specific event of interest in an user-generated video on websites. The most fundamental challenge facing this task lies in the enormously varying quality of the video as well as the high-level semantic abstraction of event inherently. In this pap...

Full description

Saved in:
Bibliographic Details
Published in:Computer vision and image understanding 2021-12, Vol.213, p.103300, Article 103300
Main Authors: Luo, Minnan, Chang, Xiaojun, Gong, Chen
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Items that cite this one
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
cited_by cdi_FETCH-LOGICAL-c300t-6b30a34e692a9cf054d9393326004f110c6d1b2d60f19ff666aabaa383712e063
cites cdi_FETCH-LOGICAL-c300t-6b30a34e692a9cf054d9393326004f110c6d1b2d60f19ff666aabaa383712e063
container_end_page
container_issue
container_start_page 103300
container_title Computer vision and image understanding
container_volume 213
creator Luo, Minnan
Chang, Xiaojun
Gong, Chen
description Multimedia event detection is the task of detecting a specific event of interest in an user-generated video on websites. The most fundamental challenge facing this task lies in the enormously varying quality of the video as well as the high-level semantic abstraction of event inherently. In this paper, we decompose the video into several segments and intuitively model the task of complex event detection as a multiple instance learning problem by representing each video as a “bag” of segments in which each segment is referred to as an instance. Instead of treating the instances equally, we associate each instance with a reliability variable to indicate its importance and then select reliable instances for training. To measure the reliability of the varying instances precisely, we propose a visual-semantic guided loss by exploiting low-level feature from visual information together with instance-event similarity based high-level semantic feature. Motivated by curriculum learning, we introduce a negative elastic-net regularization term to start training the classifier with instances of high reliability and gradually taking the instances with relatively low reliability into consideration. An alternative optimization algorithm is developed to solve the proposed challenging non-convex non-smooth problem. Experimental results on standard datasets, i.e., TRECVID MEDTest 2013 and TRECVID MEDTest 2014, demonstrate the effectiveness and superiority of the proposed method to the baseline algorithms. •A visual-semantic guided loss is proposed to measure reliability of instance for event detection.•Training begins with high-reliability instances and gradually added instances of low reliability.•Promising experimental results show the effectiveness and superiority of the proposed method.
doi_str_mv 10.1016/j.cviu.2021.103300
format article
fullrecord <record><control><sourceid>elsevier_cross</sourceid><recordid>TN_cdi_crossref_primary_10_1016_j_cviu_2021_103300</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><els_id>S1077314221001442</els_id><sourcerecordid>S1077314221001442</sourcerecordid><originalsourceid>FETCH-LOGICAL-c300t-6b30a34e692a9cf054d9393326004f110c6d1b2d60f19ff666aabaa383712e063</originalsourceid><addsrcrecordid>eNp9kN1KAzEQhYMoWKsv4FVeYOtMsk1d8EaKfyAIouJdyCYTTdmfkmwXfXuz1msvhhnmcIYzH2PnCAsEVBebhR3DbiFAYF5ICXDAZggVFEIu3w-nebUqJJbimJ2ktAFALCucsbdnaoKpG-Lpsx94cNQNwQdrhtB33PeR277dNvTFacwSdzSQ_dXGYHKlnWmKRK3JNsuprcm50H2csiNvmkRnf33OXm9vXtb3xePT3cP6-rGwOeJQqFqCkSWpSpjKeliWrpKVlEIBlB4RrHJYC6fAY-W9UsqY2hh5KVcoCJScM7G_a2OfUiSvtzG0Jn5rBD2R0Rs9kdETGb0nk01XexPlZGOgqJMN1FlyIebntOvDf_YfZettaQ</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Reliable shot identification for complex event detection via visual-semantic embedding</title><source>ScienceDirect Freedom Collection</source><creator>Luo, Minnan ; Chang, Xiaojun ; Gong, Chen</creator><creatorcontrib>Luo, Minnan ; Chang, Xiaojun ; Gong, Chen</creatorcontrib><description>Multimedia event detection is the task of detecting a specific event of interest in an user-generated video on websites. The most fundamental challenge facing this task lies in the enormously varying quality of the video as well as the high-level semantic abstraction of event inherently. In this paper, we decompose the video into several segments and intuitively model the task of complex event detection as a multiple instance learning problem by representing each video as a “bag” of segments in which each segment is referred to as an instance. Instead of treating the instances equally, we associate each instance with a reliability variable to indicate its importance and then select reliable instances for training. To measure the reliability of the varying instances precisely, we propose a visual-semantic guided loss by exploiting low-level feature from visual information together with instance-event similarity based high-level semantic feature. Motivated by curriculum learning, we introduce a negative elastic-net regularization term to start training the classifier with instances of high reliability and gradually taking the instances with relatively low reliability into consideration. An alternative optimization algorithm is developed to solve the proposed challenging non-convex non-smooth problem. Experimental results on standard datasets, i.e., TRECVID MEDTest 2013 and TRECVID MEDTest 2014, demonstrate the effectiveness and superiority of the proposed method to the baseline algorithms. •A visual-semantic guided loss is proposed to measure reliability of instance for event detection.•Training begins with high-reliability instances and gradually added instances of low reliability.•Promising experimental results show the effectiveness and superiority of the proposed method.</description><identifier>ISSN: 1077-3142</identifier><identifier>EISSN: 1090-235X</identifier><identifier>DOI: 10.1016/j.cviu.2021.103300</identifier><language>eng</language><publisher>Elsevier Inc</publisher><subject>Complex event detection ; Machine learning ; Reliable shot identification ; Visual-semantic guidance</subject><ispartof>Computer vision and image understanding, 2021-12, Vol.213, p.103300, Article 103300</ispartof><rights>2021 Elsevier Inc.</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c300t-6b30a34e692a9cf054d9393326004f110c6d1b2d60f19ff666aabaa383712e063</citedby><cites>FETCH-LOGICAL-c300t-6b30a34e692a9cf054d9393326004f110c6d1b2d60f19ff666aabaa383712e063</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>314,776,780,27901,27902</link.rule.ids></links><search><creatorcontrib>Luo, Minnan</creatorcontrib><creatorcontrib>Chang, Xiaojun</creatorcontrib><creatorcontrib>Gong, Chen</creatorcontrib><title>Reliable shot identification for complex event detection via visual-semantic embedding</title><title>Computer vision and image understanding</title><description>Multimedia event detection is the task of detecting a specific event of interest in an user-generated video on websites. The most fundamental challenge facing this task lies in the enormously varying quality of the video as well as the high-level semantic abstraction of event inherently. In this paper, we decompose the video into several segments and intuitively model the task of complex event detection as a multiple instance learning problem by representing each video as a “bag” of segments in which each segment is referred to as an instance. Instead of treating the instances equally, we associate each instance with a reliability variable to indicate its importance and then select reliable instances for training. To measure the reliability of the varying instances precisely, we propose a visual-semantic guided loss by exploiting low-level feature from visual information together with instance-event similarity based high-level semantic feature. Motivated by curriculum learning, we introduce a negative elastic-net regularization term to start training the classifier with instances of high reliability and gradually taking the instances with relatively low reliability into consideration. An alternative optimization algorithm is developed to solve the proposed challenging non-convex non-smooth problem. Experimental results on standard datasets, i.e., TRECVID MEDTest 2013 and TRECVID MEDTest 2014, demonstrate the effectiveness and superiority of the proposed method to the baseline algorithms. •A visual-semantic guided loss is proposed to measure reliability of instance for event detection.•Training begins with high-reliability instances and gradually added instances of low reliability.•Promising experimental results show the effectiveness and superiority of the proposed method.</description><subject>Complex event detection</subject><subject>Machine learning</subject><subject>Reliable shot identification</subject><subject>Visual-semantic guidance</subject><issn>1077-3142</issn><issn>1090-235X</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2021</creationdate><recordtype>article</recordtype><recordid>eNp9kN1KAzEQhYMoWKsv4FVeYOtMsk1d8EaKfyAIouJdyCYTTdmfkmwXfXuz1msvhhnmcIYzH2PnCAsEVBebhR3DbiFAYF5ICXDAZggVFEIu3w-nebUqJJbimJ2ktAFALCucsbdnaoKpG-Lpsx94cNQNwQdrhtB33PeR277dNvTFacwSdzSQ_dXGYHKlnWmKRK3JNsuprcm50H2csiNvmkRnf33OXm9vXtb3xePT3cP6-rGwOeJQqFqCkSWpSpjKeliWrpKVlEIBlB4RrHJYC6fAY-W9UsqY2hh5KVcoCJScM7G_a2OfUiSvtzG0Jn5rBD2R0Rs9kdETGb0nk01XexPlZGOgqJMN1FlyIebntOvDf_YfZettaQ</recordid><startdate>202112</startdate><enddate>202112</enddate><creator>Luo, Minnan</creator><creator>Chang, Xiaojun</creator><creator>Gong, Chen</creator><general>Elsevier Inc</general><scope>AAYXX</scope><scope>CITATION</scope></search><sort><creationdate>202112</creationdate><title>Reliable shot identification for complex event detection via visual-semantic embedding</title><author>Luo, Minnan ; Chang, Xiaojun ; Gong, Chen</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c300t-6b30a34e692a9cf054d9393326004f110c6d1b2d60f19ff666aabaa383712e063</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2021</creationdate><topic>Complex event detection</topic><topic>Machine learning</topic><topic>Reliable shot identification</topic><topic>Visual-semantic guidance</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Luo, Minnan</creatorcontrib><creatorcontrib>Chang, Xiaojun</creatorcontrib><creatorcontrib>Gong, Chen</creatorcontrib><collection>CrossRef</collection><jtitle>Computer vision and image understanding</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Luo, Minnan</au><au>Chang, Xiaojun</au><au>Gong, Chen</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Reliable shot identification for complex event detection via visual-semantic embedding</atitle><jtitle>Computer vision and image understanding</jtitle><date>2021-12</date><risdate>2021</risdate><volume>213</volume><spage>103300</spage><pages>103300-</pages><artnum>103300</artnum><issn>1077-3142</issn><eissn>1090-235X</eissn><abstract>Multimedia event detection is the task of detecting a specific event of interest in an user-generated video on websites. The most fundamental challenge facing this task lies in the enormously varying quality of the video as well as the high-level semantic abstraction of event inherently. In this paper, we decompose the video into several segments and intuitively model the task of complex event detection as a multiple instance learning problem by representing each video as a “bag” of segments in which each segment is referred to as an instance. Instead of treating the instances equally, we associate each instance with a reliability variable to indicate its importance and then select reliable instances for training. To measure the reliability of the varying instances precisely, we propose a visual-semantic guided loss by exploiting low-level feature from visual information together with instance-event similarity based high-level semantic feature. Motivated by curriculum learning, we introduce a negative elastic-net regularization term to start training the classifier with instances of high reliability and gradually taking the instances with relatively low reliability into consideration. An alternative optimization algorithm is developed to solve the proposed challenging non-convex non-smooth problem. Experimental results on standard datasets, i.e., TRECVID MEDTest 2013 and TRECVID MEDTest 2014, demonstrate the effectiveness and superiority of the proposed method to the baseline algorithms. •A visual-semantic guided loss is proposed to measure reliability of instance for event detection.•Training begins with high-reliability instances and gradually added instances of low reliability.•Promising experimental results show the effectiveness and superiority of the proposed method.</abstract><pub>Elsevier Inc</pub><doi>10.1016/j.cviu.2021.103300</doi></addata></record>
fulltext fulltext
identifier ISSN: 1077-3142
ispartof Computer vision and image understanding, 2021-12, Vol.213, p.103300, Article 103300
issn 1077-3142
1090-235X
language eng
recordid cdi_crossref_primary_10_1016_j_cviu_2021_103300
source ScienceDirect Freedom Collection
subjects Complex event detection
Machine learning
Reliable shot identification
Visual-semantic guidance
title Reliable shot identification for complex event detection via visual-semantic embedding
url http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-01T00%3A53%3A05IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-elsevier_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Reliable%20shot%20identification%20for%20complex%20event%20detection%20via%20visual-semantic%20embedding&rft.jtitle=Computer%20vision%20and%20image%20understanding&rft.au=Luo,%20Minnan&rft.date=2021-12&rft.volume=213&rft.spage=103300&rft.pages=103300-&rft.artnum=103300&rft.issn=1077-3142&rft.eissn=1090-235X&rft_id=info:doi/10.1016/j.cviu.2021.103300&rft_dat=%3Celsevier_cross%3ES1077314221001442%3C/elsevier_cross%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-c300t-6b30a34e692a9cf054d9393326004f110c6d1b2d60f19ff666aabaa383712e063%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true