Loading…

Visual-semantic Alignment Temporal Parsing for Action Quality Assessment

Action Quality Assessment (AQA) is a challenging task involving analyzing fine-grained technical subactions, aligning high-level visual-semantic representations, and exploring internal temporal structures that capture the overall meaning of given action sequences. To address these challenges, we pro...

Full description

Saved in:
Bibliographic Details
Published in:IEEE transactions on circuits and systems for video technology 2024-10, p.1-1
Main Authors: Gedamu, Kumie, Ji, Yanli, Yang, Yang, Shao, Jie, Shen, Heng Tao
Format: Article
Language:English
Subjects:
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Action Quality Assessment (AQA) is a challenging task involving analyzing fine-grained technical subactions, aligning high-level visual-semantic representations, and exploring internal temporal structures that capture the overall meaning of given action sequences. To address these challenges, we propose a Visual-semantic Alignment Temporal Parsing Network (VATP-Net) to understand the high-level visual semantics of subaction sequences and internal temporal structures without explicit supervision for action quality assessment. The proposed approach designs a self-supervised temporal parsing module to generate subaction sequences from the given video by aligning the visual and semantic action features. It captures high-level semantics and the internal temporal dynamics of subaction sequences. Furthermore, a multimodal interaction module is proposed to capture the interaction between different modalities of action features, enabling a comprehensive assessment of fine-grained and scene-invariant action details. The proposed module captures the intricate relationships and encourages interactions between different modalities within an action sequence, enhancing the overall understanding of action assessment. We exhaustively evaluate our proposed approach on the MTL-AQA, Rhythmic Gymnastics (RG), FineFS, and Fis-V datasets. Extensive experimental results demonstrate the effectiveness and feasibility of our proposed approach, which outperforms state-of-the-art methods by a significant margin.
ISSN:1051-8215
1558-2205
DOI:10.1109/TCSVT.2024.3487242