Loading…

Discriminative Segment Focus Network for Fine-grained Video Action Recognition

Fine-grained video action recognition aims at identifying minor and discriminative variations among fine categories of actions. While many recent action recognition methods have been proposed to better model spatio-temporal representations, how to model the interactions among discriminative atomic a...

Full description

Saved in:
Bibliographic Details
Published in:ACM transactions on multimedia computing communications and applications 2024-05, Vol.20 (7), p.1-20, Article 218
Main Authors: Sun, Baoli, Ye, Xinchen, Yan, Tiantian, Wang, Zhihui, Li, Haojie, Wang, Zhiyong
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
cited_by
cites cdi_FETCH-LOGICAL-a239t-8d18cc495099ccffb6c8df56bdaec14c21faabdf0f2e070f8bb6b3c8aa30ce043
container_end_page 20
container_issue 7
container_start_page 1
container_title ACM transactions on multimedia computing communications and applications
container_volume 20
creator Sun, Baoli
Ye, Xinchen
Yan, Tiantian
Wang, Zhihui
Li, Haojie
Wang, Zhiyong
description Fine-grained video action recognition aims at identifying minor and discriminative variations among fine categories of actions. While many recent action recognition methods have been proposed to better model spatio-temporal representations, how to model the interactions among discriminative atomic actions to effectively characterize inter-class and intra-class variations has been neglected, which is vital for understanding fine-grained actions. In this work, we devise a Discriminative Segment Focus Network (DSFNet) to mine the discriminability of segment correlations and localize discriminative action-relevant segments for fine-grained video action recognition. Firstly, we propose a hierarchic correlation reasoning (HCR) module which explicitly establishes correlations between different segments at multiple temporal scales and enhances each segment by exploiting the correlations with other segments. Secondly, a discriminative segment focus (DSF) module is devised to localize the most action-relevant segments from the enhanced representations of HCR by enforcing the consistency between the discriminability and the classification confidence of a given segment with a consistency constraint. Finally, these localized segment representations are combined with the global action representation of the whole video for boosting final recognition. Extensive experimental results on two fine-grained action recognition datasets, i.e., FineGym and Diving48, and two action recognition datasets, i.e., Kinetics400 and Something-Something, demonstrate the effectiveness of our approach compared with the state-of-the-art methods.
doi_str_mv 10.1145/3654671
format article
fullrecord <record><control><sourceid>acm_cross</sourceid><recordid>TN_cdi_crossref_primary_10_1145_3654671</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>3654671</sourcerecordid><originalsourceid>FETCH-LOGICAL-a239t-8d18cc495099ccffb6c8df56bdaec14c21faabdf0f2e070f8bb6b3c8aa30ce043</originalsourceid><addsrcrecordid>eNo9kE1LxDAYhIMouK7i3VNunqpJ26TpcVmtCssKfl1L8uZNibqNJFXx39t11z3NwDwMzBByytkF56W4LKQoZcX3yIQLwTOppNjfeVEdkqOUXhn7wyZkeeUTRL_yvR78F9JH7FbYD7QJ8JnoEofvEN-oC5E2vsesi3oUS1-8xUBnMPjQ0weE0PV-7Y_JgdPvCU-2OiXPzfXT_DZb3N_czWeLTOdFPWTKcgVQ1oLVNYBzRoKyTkhjNQIvIedOa2MdczmyijlljDQFKK0LBsjKYkrON70QQ0oRXfsxjtDxp-WsXd_Qbm8YybMNqWG1g_7DXzyOWVU</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Discriminative Segment Focus Network for Fine-grained Video Action Recognition</title><source>Association for Computing Machinery:Jisc Collections:ACM OPEN Journals 2023-2025 (reading list)</source><creator>Sun, Baoli ; Ye, Xinchen ; Yan, Tiantian ; Wang, Zhihui ; Li, Haojie ; Wang, Zhiyong</creator><creatorcontrib>Sun, Baoli ; Ye, Xinchen ; Yan, Tiantian ; Wang, Zhihui ; Li, Haojie ; Wang, Zhiyong</creatorcontrib><description>Fine-grained video action recognition aims at identifying minor and discriminative variations among fine categories of actions. While many recent action recognition methods have been proposed to better model spatio-temporal representations, how to model the interactions among discriminative atomic actions to effectively characterize inter-class and intra-class variations has been neglected, which is vital for understanding fine-grained actions. In this work, we devise a Discriminative Segment Focus Network (DSFNet) to mine the discriminability of segment correlations and localize discriminative action-relevant segments for fine-grained video action recognition. Firstly, we propose a hierarchic correlation reasoning (HCR) module which explicitly establishes correlations between different segments at multiple temporal scales and enhances each segment by exploiting the correlations with other segments. Secondly, a discriminative segment focus (DSF) module is devised to localize the most action-relevant segments from the enhanced representations of HCR by enforcing the consistency between the discriminability and the classification confidence of a given segment with a consistency constraint. Finally, these localized segment representations are combined with the global action representation of the whole video for boosting final recognition. Extensive experimental results on two fine-grained action recognition datasets, i.e., FineGym and Diving48, and two action recognition datasets, i.e., Kinetics400 and Something-Something, demonstrate the effectiveness of our approach compared with the state-of-the-art methods.</description><identifier>ISSN: 1551-6857</identifier><identifier>EISSN: 1551-6865</identifier><identifier>DOI: 10.1145/3654671</identifier><language>eng</language><publisher>New York, NY: ACM</publisher><subject>Activity recognition and understanding ; Computing methodologies</subject><ispartof>ACM transactions on multimedia computing communications and applications, 2024-05, Vol.20 (7), p.1-20, Article 218</ispartof><rights>Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><cites>FETCH-LOGICAL-a239t-8d18cc495099ccffb6c8df56bdaec14c21faabdf0f2e070f8bb6b3c8aa30ce043</cites><orcidid>0000-0001-5328-3911 ; 0000-0003-3882-2205 ; 0000-0002-2861-4288 ; 0000-0002-5011-9726 ; 0000-0002-0811-9706 ; 0000-0002-8043-0312</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>314,778,782,27907,27908</link.rule.ids></links><search><creatorcontrib>Sun, Baoli</creatorcontrib><creatorcontrib>Ye, Xinchen</creatorcontrib><creatorcontrib>Yan, Tiantian</creatorcontrib><creatorcontrib>Wang, Zhihui</creatorcontrib><creatorcontrib>Li, Haojie</creatorcontrib><creatorcontrib>Wang, Zhiyong</creatorcontrib><title>Discriminative Segment Focus Network for Fine-grained Video Action Recognition</title><title>ACM transactions on multimedia computing communications and applications</title><addtitle>ACM TOMM</addtitle><description>Fine-grained video action recognition aims at identifying minor and discriminative variations among fine categories of actions. While many recent action recognition methods have been proposed to better model spatio-temporal representations, how to model the interactions among discriminative atomic actions to effectively characterize inter-class and intra-class variations has been neglected, which is vital for understanding fine-grained actions. In this work, we devise a Discriminative Segment Focus Network (DSFNet) to mine the discriminability of segment correlations and localize discriminative action-relevant segments for fine-grained video action recognition. Firstly, we propose a hierarchic correlation reasoning (HCR) module which explicitly establishes correlations between different segments at multiple temporal scales and enhances each segment by exploiting the correlations with other segments. Secondly, a discriminative segment focus (DSF) module is devised to localize the most action-relevant segments from the enhanced representations of HCR by enforcing the consistency between the discriminability and the classification confidence of a given segment with a consistency constraint. Finally, these localized segment representations are combined with the global action representation of the whole video for boosting final recognition. Extensive experimental results on two fine-grained action recognition datasets, i.e., FineGym and Diving48, and two action recognition datasets, i.e., Kinetics400 and Something-Something, demonstrate the effectiveness of our approach compared with the state-of-the-art methods.</description><subject>Activity recognition and understanding</subject><subject>Computing methodologies</subject><issn>1551-6857</issn><issn>1551-6865</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><recordid>eNo9kE1LxDAYhIMouK7i3VNunqpJ26TpcVmtCssKfl1L8uZNibqNJFXx39t11z3NwDwMzBByytkF56W4LKQoZcX3yIQLwTOppNjfeVEdkqOUXhn7wyZkeeUTRL_yvR78F9JH7FbYD7QJ8JnoEofvEN-oC5E2vsesi3oUS1-8xUBnMPjQ0weE0PV-7Y_JgdPvCU-2OiXPzfXT_DZb3N_czWeLTOdFPWTKcgVQ1oLVNYBzRoKyTkhjNQIvIedOa2MdczmyijlljDQFKK0LBsjKYkrON70QQ0oRXfsxjtDxp-WsXd_Qbm8YybMNqWG1g_7DXzyOWVU</recordid><startdate>20240515</startdate><enddate>20240515</enddate><creator>Sun, Baoli</creator><creator>Ye, Xinchen</creator><creator>Yan, Tiantian</creator><creator>Wang, Zhihui</creator><creator>Li, Haojie</creator><creator>Wang, Zhiyong</creator><general>ACM</general><scope>AAYXX</scope><scope>CITATION</scope><orcidid>https://orcid.org/0000-0001-5328-3911</orcidid><orcidid>https://orcid.org/0000-0003-3882-2205</orcidid><orcidid>https://orcid.org/0000-0002-2861-4288</orcidid><orcidid>https://orcid.org/0000-0002-5011-9726</orcidid><orcidid>https://orcid.org/0000-0002-0811-9706</orcidid><orcidid>https://orcid.org/0000-0002-8043-0312</orcidid></search><sort><creationdate>20240515</creationdate><title>Discriminative Segment Focus Network for Fine-grained Video Action Recognition</title><author>Sun, Baoli ; Ye, Xinchen ; Yan, Tiantian ; Wang, Zhihui ; Li, Haojie ; Wang, Zhiyong</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a239t-8d18cc495099ccffb6c8df56bdaec14c21faabdf0f2e070f8bb6b3c8aa30ce043</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Activity recognition and understanding</topic><topic>Computing methodologies</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Sun, Baoli</creatorcontrib><creatorcontrib>Ye, Xinchen</creatorcontrib><creatorcontrib>Yan, Tiantian</creatorcontrib><creatorcontrib>Wang, Zhihui</creatorcontrib><creatorcontrib>Li, Haojie</creatorcontrib><creatorcontrib>Wang, Zhiyong</creatorcontrib><collection>CrossRef</collection><jtitle>ACM transactions on multimedia computing communications and applications</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Sun, Baoli</au><au>Ye, Xinchen</au><au>Yan, Tiantian</au><au>Wang, Zhihui</au><au>Li, Haojie</au><au>Wang, Zhiyong</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Discriminative Segment Focus Network for Fine-grained Video Action Recognition</atitle><jtitle>ACM transactions on multimedia computing communications and applications</jtitle><stitle>ACM TOMM</stitle><date>2024-05-15</date><risdate>2024</risdate><volume>20</volume><issue>7</issue><spage>1</spage><epage>20</epage><pages>1-20</pages><artnum>218</artnum><issn>1551-6857</issn><eissn>1551-6865</eissn><abstract>Fine-grained video action recognition aims at identifying minor and discriminative variations among fine categories of actions. While many recent action recognition methods have been proposed to better model spatio-temporal representations, how to model the interactions among discriminative atomic actions to effectively characterize inter-class and intra-class variations has been neglected, which is vital for understanding fine-grained actions. In this work, we devise a Discriminative Segment Focus Network (DSFNet) to mine the discriminability of segment correlations and localize discriminative action-relevant segments for fine-grained video action recognition. Firstly, we propose a hierarchic correlation reasoning (HCR) module which explicitly establishes correlations between different segments at multiple temporal scales and enhances each segment by exploiting the correlations with other segments. Secondly, a discriminative segment focus (DSF) module is devised to localize the most action-relevant segments from the enhanced representations of HCR by enforcing the consistency between the discriminability and the classification confidence of a given segment with a consistency constraint. Finally, these localized segment representations are combined with the global action representation of the whole video for boosting final recognition. Extensive experimental results on two fine-grained action recognition datasets, i.e., FineGym and Diving48, and two action recognition datasets, i.e., Kinetics400 and Something-Something, demonstrate the effectiveness of our approach compared with the state-of-the-art methods.</abstract><cop>New York, NY</cop><pub>ACM</pub><doi>10.1145/3654671</doi><tpages>20</tpages><orcidid>https://orcid.org/0000-0001-5328-3911</orcidid><orcidid>https://orcid.org/0000-0003-3882-2205</orcidid><orcidid>https://orcid.org/0000-0002-2861-4288</orcidid><orcidid>https://orcid.org/0000-0002-5011-9726</orcidid><orcidid>https://orcid.org/0000-0002-0811-9706</orcidid><orcidid>https://orcid.org/0000-0002-8043-0312</orcidid><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier ISSN: 1551-6857
ispartof ACM transactions on multimedia computing communications and applications, 2024-05, Vol.20 (7), p.1-20, Article 218
issn 1551-6857
1551-6865
language eng
recordid cdi_crossref_primary_10_1145_3654671
source Association for Computing Machinery:Jisc Collections:ACM OPEN Journals 2023-2025 (reading list)
subjects Activity recognition and understanding
Computing methodologies
title Discriminative Segment Focus Network for Fine-grained Video Action Recognition
url http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-16T14%3A27%3A51IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-acm_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Discriminative%20Segment%20Focus%20Network%20for%20Fine-grained%20Video%20Action%20Recognition&rft.jtitle=ACM%20transactions%20on%20multimedia%20computing%20communications%20and%20applications&rft.au=Sun,%20Baoli&rft.date=2024-05-15&rft.volume=20&rft.issue=7&rft.spage=1&rft.epage=20&rft.pages=1-20&rft.artnum=218&rft.issn=1551-6857&rft.eissn=1551-6865&rft_id=info:doi/10.1145/3654671&rft_dat=%3Cacm_cross%3E3654671%3C/acm_cross%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-a239t-8d18cc495099ccffb6c8df56bdaec14c21faabdf0f2e070f8bb6b3c8aa30ce043%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true