Loading…

PeVL: Pose-Enhanced Vision-Language Model for Fine-Grained Human Action Recognition

Recent progress in Vision-Language (VL) foundation models has revealed the great advantages of cross-modality learning. However, due to a large gap between vision and text, they might not be able to sufficiently utilize the benefits of cross-modality information. In the field of human action recogni...

Full description

Saved in:
Bibliographic Details
Main Authors: Zhang, Haosong, Leong, Mei Chee, Li, Liyuan, Lin, Weisi
Format: Conference Proceeding
Language:English
Subjects:
Online Access:Request full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Recent progress in Vision-Language (VL) foundation models has revealed the great advantages of cross-modality learning. However, due to a large gap between vision and text, they might not be able to sufficiently utilize the benefits of cross-modality information. In the field of human action recognition, the additional pose modality may bridge the gap between vision and text to improve the effective-ness of cross-modality learning. In this paper, we propose a novel framework, called Pose-enhanced Vision-Language (Pe VL) model, to adapt the VL model with pose modality to learn effective knowledge offine-grained human actions. Our PeVL model includes two novel components: an Un-symmetrical Cross-Modality Refinement (UCMR) block and a Semantic-Guided Multi-level Contrastive (SGMC) mod-ule. The UCMR block includes Pose-guided Visual Refine-ment (P2V-R) and Visual-enriched Pose Refinement (V2P-R) for effective cross-modality learning. The SGMC module includes Multi-level Contrastive Associations of vision-text and pose-text at both action and sub-action levels, and a Semantic-Guided Loss, enabling effective contrastive learning with text. Built upon a pre-trained VLfoundation model, our model integrates trainable adapters and can be trained end-to-end. Our novel PeVL design over VL foundation model yields remarkable performance gains on four fine-grained human action recognition datasets, achieving a new SOTA with a significantly small number of FLOPs for low-cost re-training. 1
ISSN:2575-7075
DOI:10.1109/CVPR52733.2024.01784