Loading…
Discriminative Segment Focus Network for Fine-grained Video Action Recognition
Fine-grained video action recognition aims at identifying minor and discriminative variations among fine categories of actions. While many recent action recognition methods have been proposed to better model spatio-temporal representations, how to model the interactions among discriminative atomic a...
Saved in:
Published in: | ACM transactions on multimedia computing communications and applications 2024-05, Vol.20 (7), p.1-20, Article 218 |
---|---|
Main Authors: | , , , , , |
Format: | Article |
Language: | English |
Subjects: | |
Citations: | Items that this one cites |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
cited_by | |
---|---|
cites | cdi_FETCH-LOGICAL-a239t-8d18cc495099ccffb6c8df56bdaec14c21faabdf0f2e070f8bb6b3c8aa30ce043 |
container_end_page | 20 |
container_issue | 7 |
container_start_page | 1 |
container_title | ACM transactions on multimedia computing communications and applications |
container_volume | 20 |
creator | Sun, Baoli Ye, Xinchen Yan, Tiantian Wang, Zhihui Li, Haojie Wang, Zhiyong |
description | Fine-grained video action recognition aims at identifying minor and discriminative variations among fine categories of actions. While many recent action recognition methods have been proposed to better model spatio-temporal representations, how to model the interactions among discriminative atomic actions to effectively characterize inter-class and intra-class variations has been neglected, which is vital for understanding fine-grained actions. In this work, we devise a Discriminative Segment Focus Network (DSFNet) to mine the discriminability of segment correlations and localize discriminative action-relevant segments for fine-grained video action recognition. Firstly, we propose a hierarchic correlation reasoning (HCR) module which explicitly establishes correlations between different segments at multiple temporal scales and enhances each segment by exploiting the correlations with other segments. Secondly, a discriminative segment focus (DSF) module is devised to localize the most action-relevant segments from the enhanced representations of HCR by enforcing the consistency between the discriminability and the classification confidence of a given segment with a consistency constraint. Finally, these localized segment representations are combined with the global action representation of the whole video for boosting final recognition. Extensive experimental results on two fine-grained action recognition datasets, i.e., FineGym and Diving48, and two action recognition datasets, i.e., Kinetics400 and Something-Something, demonstrate the effectiveness of our approach compared with the state-of-the-art methods. |
doi_str_mv | 10.1145/3654671 |
format | article |
fullrecord | <record><control><sourceid>acm_cross</sourceid><recordid>TN_cdi_crossref_primary_10_1145_3654671</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>3654671</sourcerecordid><originalsourceid>FETCH-LOGICAL-a239t-8d18cc495099ccffb6c8df56bdaec14c21faabdf0f2e070f8bb6b3c8aa30ce043</originalsourceid><addsrcrecordid>eNo9kE1LxDAYhIMouK7i3VNunqpJ26TpcVmtCssKfl1L8uZNibqNJFXx39t11z3NwDwMzBByytkF56W4LKQoZcX3yIQLwTOppNjfeVEdkqOUXhn7wyZkeeUTRL_yvR78F9JH7FbYD7QJ8JnoEofvEN-oC5E2vsesi3oUS1-8xUBnMPjQ0weE0PV-7Y_JgdPvCU-2OiXPzfXT_DZb3N_czWeLTOdFPWTKcgVQ1oLVNYBzRoKyTkhjNQIvIedOa2MdczmyijlljDQFKK0LBsjKYkrON70QQ0oRXfsxjtDxp-WsXd_Qbm8YybMNqWG1g_7DXzyOWVU</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Discriminative Segment Focus Network for Fine-grained Video Action Recognition</title><source>Association for Computing Machinery:Jisc Collections:ACM OPEN Journals 2023-2025 (reading list)</source><creator>Sun, Baoli ; Ye, Xinchen ; Yan, Tiantian ; Wang, Zhihui ; Li, Haojie ; Wang, Zhiyong</creator><creatorcontrib>Sun, Baoli ; Ye, Xinchen ; Yan, Tiantian ; Wang, Zhihui ; Li, Haojie ; Wang, Zhiyong</creatorcontrib><description>Fine-grained video action recognition aims at identifying minor and discriminative variations among fine categories of actions. While many recent action recognition methods have been proposed to better model spatio-temporal representations, how to model the interactions among discriminative atomic actions to effectively characterize inter-class and intra-class variations has been neglected, which is vital for understanding fine-grained actions. In this work, we devise a Discriminative Segment Focus Network (DSFNet) to mine the discriminability of segment correlations and localize discriminative action-relevant segments for fine-grained video action recognition. Firstly, we propose a hierarchic correlation reasoning (HCR) module which explicitly establishes correlations between different segments at multiple temporal scales and enhances each segment by exploiting the correlations with other segments. Secondly, a discriminative segment focus (DSF) module is devised to localize the most action-relevant segments from the enhanced representations of HCR by enforcing the consistency between the discriminability and the classification confidence of a given segment with a consistency constraint. Finally, these localized segment representations are combined with the global action representation of the whole video for boosting final recognition. Extensive experimental results on two fine-grained action recognition datasets, i.e., FineGym and Diving48, and two action recognition datasets, i.e., Kinetics400 and Something-Something, demonstrate the effectiveness of our approach compared with the state-of-the-art methods.</description><identifier>ISSN: 1551-6857</identifier><identifier>EISSN: 1551-6865</identifier><identifier>DOI: 10.1145/3654671</identifier><language>eng</language><publisher>New York, NY: ACM</publisher><subject>Activity recognition and understanding ; Computing methodologies</subject><ispartof>ACM transactions on multimedia computing communications and applications, 2024-05, Vol.20 (7), p.1-20, Article 218</ispartof><rights>Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><cites>FETCH-LOGICAL-a239t-8d18cc495099ccffb6c8df56bdaec14c21faabdf0f2e070f8bb6b3c8aa30ce043</cites><orcidid>0000-0001-5328-3911 ; 0000-0003-3882-2205 ; 0000-0002-2861-4288 ; 0000-0002-5011-9726 ; 0000-0002-0811-9706 ; 0000-0002-8043-0312</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>314,778,782,27907,27908</link.rule.ids></links><search><creatorcontrib>Sun, Baoli</creatorcontrib><creatorcontrib>Ye, Xinchen</creatorcontrib><creatorcontrib>Yan, Tiantian</creatorcontrib><creatorcontrib>Wang, Zhihui</creatorcontrib><creatorcontrib>Li, Haojie</creatorcontrib><creatorcontrib>Wang, Zhiyong</creatorcontrib><title>Discriminative Segment Focus Network for Fine-grained Video Action Recognition</title><title>ACM transactions on multimedia computing communications and applications</title><addtitle>ACM TOMM</addtitle><description>Fine-grained video action recognition aims at identifying minor and discriminative variations among fine categories of actions. While many recent action recognition methods have been proposed to better model spatio-temporal representations, how to model the interactions among discriminative atomic actions to effectively characterize inter-class and intra-class variations has been neglected, which is vital for understanding fine-grained actions. In this work, we devise a Discriminative Segment Focus Network (DSFNet) to mine the discriminability of segment correlations and localize discriminative action-relevant segments for fine-grained video action recognition. Firstly, we propose a hierarchic correlation reasoning (HCR) module which explicitly establishes correlations between different segments at multiple temporal scales and enhances each segment by exploiting the correlations with other segments. Secondly, a discriminative segment focus (DSF) module is devised to localize the most action-relevant segments from the enhanced representations of HCR by enforcing the consistency between the discriminability and the classification confidence of a given segment with a consistency constraint. Finally, these localized segment representations are combined with the global action representation of the whole video for boosting final recognition. Extensive experimental results on two fine-grained action recognition datasets, i.e., FineGym and Diving48, and two action recognition datasets, i.e., Kinetics400 and Something-Something, demonstrate the effectiveness of our approach compared with the state-of-the-art methods.</description><subject>Activity recognition and understanding</subject><subject>Computing methodologies</subject><issn>1551-6857</issn><issn>1551-6865</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><recordid>eNo9kE1LxDAYhIMouK7i3VNunqpJ26TpcVmtCssKfl1L8uZNibqNJFXx39t11z3NwDwMzBByytkF56W4LKQoZcX3yIQLwTOppNjfeVEdkqOUXhn7wyZkeeUTRL_yvR78F9JH7FbYD7QJ8JnoEofvEN-oC5E2vsesi3oUS1-8xUBnMPjQ0weE0PV-7Y_JgdPvCU-2OiXPzfXT_DZb3N_czWeLTOdFPWTKcgVQ1oLVNYBzRoKyTkhjNQIvIedOa2MdczmyijlljDQFKK0LBsjKYkrON70QQ0oRXfsxjtDxp-WsXd_Qbm8YybMNqWG1g_7DXzyOWVU</recordid><startdate>20240515</startdate><enddate>20240515</enddate><creator>Sun, Baoli</creator><creator>Ye, Xinchen</creator><creator>Yan, Tiantian</creator><creator>Wang, Zhihui</creator><creator>Li, Haojie</creator><creator>Wang, Zhiyong</creator><general>ACM</general><scope>AAYXX</scope><scope>CITATION</scope><orcidid>https://orcid.org/0000-0001-5328-3911</orcidid><orcidid>https://orcid.org/0000-0003-3882-2205</orcidid><orcidid>https://orcid.org/0000-0002-2861-4288</orcidid><orcidid>https://orcid.org/0000-0002-5011-9726</orcidid><orcidid>https://orcid.org/0000-0002-0811-9706</orcidid><orcidid>https://orcid.org/0000-0002-8043-0312</orcidid></search><sort><creationdate>20240515</creationdate><title>Discriminative Segment Focus Network for Fine-grained Video Action Recognition</title><author>Sun, Baoli ; Ye, Xinchen ; Yan, Tiantian ; Wang, Zhihui ; Li, Haojie ; Wang, Zhiyong</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a239t-8d18cc495099ccffb6c8df56bdaec14c21faabdf0f2e070f8bb6b3c8aa30ce043</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Activity recognition and understanding</topic><topic>Computing methodologies</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Sun, Baoli</creatorcontrib><creatorcontrib>Ye, Xinchen</creatorcontrib><creatorcontrib>Yan, Tiantian</creatorcontrib><creatorcontrib>Wang, Zhihui</creatorcontrib><creatorcontrib>Li, Haojie</creatorcontrib><creatorcontrib>Wang, Zhiyong</creatorcontrib><collection>CrossRef</collection><jtitle>ACM transactions on multimedia computing communications and applications</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Sun, Baoli</au><au>Ye, Xinchen</au><au>Yan, Tiantian</au><au>Wang, Zhihui</au><au>Li, Haojie</au><au>Wang, Zhiyong</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Discriminative Segment Focus Network for Fine-grained Video Action Recognition</atitle><jtitle>ACM transactions on multimedia computing communications and applications</jtitle><stitle>ACM TOMM</stitle><date>2024-05-15</date><risdate>2024</risdate><volume>20</volume><issue>7</issue><spage>1</spage><epage>20</epage><pages>1-20</pages><artnum>218</artnum><issn>1551-6857</issn><eissn>1551-6865</eissn><abstract>Fine-grained video action recognition aims at identifying minor and discriminative variations among fine categories of actions. While many recent action recognition methods have been proposed to better model spatio-temporal representations, how to model the interactions among discriminative atomic actions to effectively characterize inter-class and intra-class variations has been neglected, which is vital for understanding fine-grained actions. In this work, we devise a Discriminative Segment Focus Network (DSFNet) to mine the discriminability of segment correlations and localize discriminative action-relevant segments for fine-grained video action recognition. Firstly, we propose a hierarchic correlation reasoning (HCR) module which explicitly establishes correlations between different segments at multiple temporal scales and enhances each segment by exploiting the correlations with other segments. Secondly, a discriminative segment focus (DSF) module is devised to localize the most action-relevant segments from the enhanced representations of HCR by enforcing the consistency between the discriminability and the classification confidence of a given segment with a consistency constraint. Finally, these localized segment representations are combined with the global action representation of the whole video for boosting final recognition. Extensive experimental results on two fine-grained action recognition datasets, i.e., FineGym and Diving48, and two action recognition datasets, i.e., Kinetics400 and Something-Something, demonstrate the effectiveness of our approach compared with the state-of-the-art methods.</abstract><cop>New York, NY</cop><pub>ACM</pub><doi>10.1145/3654671</doi><tpages>20</tpages><orcidid>https://orcid.org/0000-0001-5328-3911</orcidid><orcidid>https://orcid.org/0000-0003-3882-2205</orcidid><orcidid>https://orcid.org/0000-0002-2861-4288</orcidid><orcidid>https://orcid.org/0000-0002-5011-9726</orcidid><orcidid>https://orcid.org/0000-0002-0811-9706</orcidid><orcidid>https://orcid.org/0000-0002-8043-0312</orcidid><oa>free_for_read</oa></addata></record> |
fulltext | fulltext |
identifier | ISSN: 1551-6857 |
ispartof | ACM transactions on multimedia computing communications and applications, 2024-05, Vol.20 (7), p.1-20, Article 218 |
issn | 1551-6857 1551-6865 |
language | eng |
recordid | cdi_crossref_primary_10_1145_3654671 |
source | Association for Computing Machinery:Jisc Collections:ACM OPEN Journals 2023-2025 (reading list) |
subjects | Activity recognition and understanding Computing methodologies |
title | Discriminative Segment Focus Network for Fine-grained Video Action Recognition |
url | http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-16T14%3A27%3A51IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-acm_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Discriminative%20Segment%20Focus%20Network%20for%20Fine-grained%20Video%20Action%20Recognition&rft.jtitle=ACM%20transactions%20on%20multimedia%20computing%20communications%20and%20applications&rft.au=Sun,%20Baoli&rft.date=2024-05-15&rft.volume=20&rft.issue=7&rft.spage=1&rft.epage=20&rft.pages=1-20&rft.artnum=218&rft.issn=1551-6857&rft.eissn=1551-6865&rft_id=info:doi/10.1145/3654671&rft_dat=%3Cacm_cross%3E3654671%3C/acm_cross%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-a239t-8d18cc495099ccffb6c8df56bdaec14c21faabdf0f2e070f8bb6b3c8aa30ce043%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true |