Loading…

Human Vision Attention Mechanism-Inspired Temporal-Spatial Feature Pyramid for Video Saliency Detection

Inspired by the human vision attention mechanism, the human vision system uses multilevel features to extract accurate visual saliency information, so multilevel features are important for saliency detection. On the basis of the numerous biological frameworks for visual information processing, we fi...

Full description

Saved in:
Bibliographic Details
Published in:Cognitive computation 2023-05, Vol.15 (3), p.856-868
Main Authors: Chang, Qinyao, Zhu, Shiping
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Items that cite this one
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Inspired by the human vision attention mechanism, the human vision system uses multilevel features to extract accurate visual saliency information, so multilevel features are important for saliency detection. On the basis of the numerous biological frameworks for visual information processing, we find that better combination and use of multilevel features with time information can greatly improve the accuracy of the video saliency model. The proposed TSFP-Net has the advantages of much higher prediction precision, simple structure, second smallest size, and the third fastest running time compared to the state-of-the-art methods. The encoder extracts multiscale temporal-spatial features from the input continuous video frames and then constructs a temporal-spatial feature pyramid through temporal-spatial convolution and top-down feature integration. The decoder performs hierarchical decoding of temporal-spatial features from different scales and finally produces a saliency map from the integration of multiple video frames. Our model is simple yet effective and can run in real time. We perform abundant experiments, and the results indicate that the well-designed structure can significantly improve the precision of video saliency detection. Experimental results on three purely visual video saliency benchmarks demonstrate that our method outperforms the existing state-of-the-art methods.
ISSN:1866-9956
1866-9964
DOI:10.1007/s12559-023-10114-x