Loading…
Robust Video-Based Person Re-Identification by Hierarchical Mining
Video-based person re-identification (Re-ID) aims at retrieving the person through the video sequences across non-overlapping cameras. Some characteristics of pedestrians are not consecutive across frames due to the variations of viewpoints, postures, and occlusions over time. However, existing meth...
Saved in:
Published in: | IEEE transactions on circuits and systems for video technology 2022-12, Vol.32 (12), p.8179-8191 |
---|---|
Main Authors: | , , , , , , |
Format: | Article |
Language: | English |
Subjects: | |
Citations: | Items that this one cites Items that cite this one |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | Video-based person re-identification (Re-ID) aims at retrieving the person through the video sequences across non-overlapping cameras. Some characteristics of pedestrians are not consecutive across frames due to the variations of viewpoints, postures, and occlusions over time. However, existing methods ignore such data peculiarity and the networks tend to only learn those salient consecutive characteristics among frames in video sequences. As a result, the learned representations fail to cover all the characteristics of pedestrians, thus lacking integrity and discrimination. To tackle this problem, we present a novel deep architecture termed Hierarchical Mining Network (HMN), which mines as many pedestrians' characteristics by referring to the temporal and intra-class knowledge. It consists of a novel Attentive Temporal Module (ATM) and a Dynamic Supervising Branch (DSB), with a Balancing Triplet Loss (BTL) assisting the training. The proposed ATM, with pedestrian perceiving capacity, is capable of evaluating each activation of features through temporal analysis, so that the temporally scattered characteristics of pedestrians can be better aggregated and the contaminated ones can be eliminated. Then, the DSB along with the BTL further enhances the integrity of representations by multiple supervision. Specifically, the DSB perceives the diversities of intra-class samples in each mini-batch and generates targeted supervising signals for them, in which process the BTL guarantees the signals with smaller intra-class variations and larger inter-class variations. Comprehensive experiments on two video-based datasets, i.e., MARS, and DukeMTMC-VideoReID, demonstrate the contribution of each component and the superiority of the proposed HMN over the state-of-the-arts. Benchmarking our model on three popular image-based datasets, i.e., Market1501, DukeMTMC-Reid, and MSMT17 additionally verifies the promising generalizability of the proposed DSB and BTL. |
---|---|
ISSN: | 1051-8215 1558-2205 |
DOI: | 10.1109/TCSVT.2021.3076097 |