Loading…

Online object tracking via motion-guided convolutional neural network (MGNet)

•Following the idea of tracking-by-detection(TBD), we investigate the advantages of motion cue in the online tracking problem. The temporal motion (optical flow map) is very important to handle complicated motion scenarios such as articulated motion and fast motion.•On one hand, the use of dynamic m...

Full description

Saved in:
Bibliographic Details
Published in:Journal of visual communication and image representation 2018-05, Vol.53, p.180-191
Main Authors: Gan, Weihao, Lee, Ming-Sui, Wu, Chi-hao, Kuo, C.-C. (Jay)
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Items that cite this one
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:•Following the idea of tracking-by-detection(TBD), we investigate the advantages of motion cue in the online tracking problem. The temporal motion (optical flow map) is very important to handle complicated motion scenarios such as articulated motion and fast motion.•On one hand, the use of dynamic motion model to generate the correct candidate regions is essential for tracking. On another hand, an accurate target location estimation also reduces the number of candidates and speeds up the tracking process.•The spatial RGB and temporal optical flow are combined as inputs and processed in a unified end-to-end trained network, rather than two-branch processing network, to show the discriminative power of the tracking system. Tracking-by-detection (TBD) is widely used in visual object tracking. However, many TBD-based methods ignore the strong motion correlation between current and previous frames. In this work, a motion-guided convolutional neural network (MGNet) solution to online object tracking is proposed. The MGNet tracker is built upon the multi-domain convolutional neural network with two innovations: (1) a motion-guided candidate selection (MCS) scheme based on a dynamic prediction model is proposed to accurately and efficiently generate the candidate regions and (2) the spatial RGB and temporal optical flow are combined as inputs and processed in an unified end-to-end trained network, rather than a two-branch processing network. We compare the performance of the MGNet, the MDNet and several state-of-the-art online object trackers on the OTB and the VOT benchmark datasets, and demonstrate that the temporal correlation between any two consecutive frames in videos can be more effectively captured by the MGNet via extensive performance evaluation.
ISSN:1047-3203
1095-9076
DOI:10.1016/j.jvcir.2018.03.016