Loading…

Adaptive Spatial Location With Balanced Loss for Video Captioning

Many pioneering approaches have verified the effectiveness of utilizing the global temporal and local object information for video understanding tasks and have achieved significant progress. However, existing methods utilize object detectors to extract all objects overall video frames. This may brin...

Full description

Saved in:
Bibliographic Details
Published in:IEEE transactions on circuits and systems for video technology 2022-01, Vol.32 (1), p.17-30
Main Authors: Li, Linghui, Zhang, Yongdong, Tang, Sheng, Xie, Lingxi, Li, Xiaoyong, Tian, Qi
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Items that cite this one
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Many pioneering approaches have verified the effectiveness of utilizing the global temporal and local object information for video understanding tasks and have achieved significant progress. However, existing methods utilize object detectors to extract all objects overall video frames. This may bring performance degradation due to the information redundancy both spatially and temporally. To address this problem, we propose an adaptive spatial location module for the video captioning task which dynamically predicts an important position of each video frame in the procedure of generating the description sentence. The proposed adaptive spatial location method not only makes our model focus on local object information, but also reduces time and memory consumption brought by the temporal redundancy in extensive video frames and improves the accuracy of generated description. Besides, we propose a balanced loss function to address the class imbalance problem existing in training data. The proposed balanced loss assigns different weight to each word of ground-truth sentence in the training process which can generate more diversified description sentences. Extensive experimental results on the MSVD and MSR-VTT dataset show that the proposed method achieves competitive performance compared to state-of-the-art methods.
ISSN:1051-8215
1558-2205
DOI:10.1109/TCSVT.2020.3045735