Loading…

Object-centric Video Representation for Long-term Action Anticipation

This paper focuses on building object-centric representations for long-term action anticipation in videos. Our key motivation is that objects provide important cues to recognize and predict human-object interactions, especially when the predictions are longer term, as an observed "background&qu...

Full description

Saved in:

Bibliographic Details
Main Authors:	Zhang, Ce, Fu, Changcheng, Wang, Shijie, Agarwal, Nakul, Lee, Kwonjoon, Choi, Chiho, Sun, Chen
Format:	Conference Proceeding
Language:	English
Subjects:	Algorithms Benchmark testing Computational modeling Computer architecture Computer vision Detectors Pipelines Predictive models Video recognition and understanding
Online Access:	Request full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	This paper focuses on building object-centric representations for long-term action anticipation in videos. Our key motivation is that objects provide important cues to recognize and predict human-object interactions, especially when the predictions are longer term, as an observed "background" object could be used by the human actor in the future. We observe that existing object-based video recognition frameworks either assume the existence of in-domain supervised object detectors or follow a fully weakly-supervised pipeline to infer object locations from action labels. We propose to build object-centric video representations by leveraging visual-language pretrained models. This is achieved by "object prompts", an approach to extract task-specific object-centric representations from general-purpose pretrained models without finetuning. To recognize and predict human-object interactions, we use a Transformer-based neural architecture which allows the "retrieval" of relevant objects for action anticipation at various time scales. We conduct extensive evaluations on the Ego4D, 50Salads, and EGTEA Gaze+ benchmarks. Both quantitative and qualitative results confirm the effectiveness of our proposed method. Our code is available at github.com/brown-palm/ObjectPrompt.
ISSN:	2642-9381
DOI:	10.1109/WACV57701.2024.00661