Loading…

AMS-CNN: Attentive multi-stream CNN for video-based crowd counting

In recent years video-based crowd counting and density estimation (CCDE) have become essential for crowd analysis. Current approaches rarely exploit spatial–temporal features for CCDE, and they also usually do not consider measures to minimize the frame's background influence for obtaining crow...

Full description

Saved in:
Bibliographic Details
Published in:International journal of multimedia information retrieval 2021-12, Vol.10 (4), p.239-254
Main Authors: Tripathy, Santosh Kumar, Srivastava, Rajeev
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Items that cite this one
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:In recent years video-based crowd counting and density estimation (CCDE) have become essential for crowd analysis. Current approaches rarely exploit spatial–temporal features for CCDE, and they also usually do not consider measures to minimize the frame's background influence for obtaining crowd density maps, which has resulted in lower performance in terms of Mean Absolute Error (MAE) and Root Mean Squared Error (RMSE). Again, attention to individual feature set's response toward crowd counting is also neglected. To this end, we are motivated to design an end-to-end trainable attentive multi-stream convolutional neural network (AMS-CNN) for crowd counting. At first, a multi-stream CNN (MS-CNN) is designed to obtain crowd density maps. The MS-CNN comprises three streams to fuse deep spatial, temporal, and spatial foreground features from different cues of the crowd video dataset, like frames, the volume of frames, and foregrounds of frames. To improve the accuracy, we designed three stream-wise attention modules to generate attentive crowd density maps, and their relative average is obtained using a relative averaged attentive density-map (RAAD) layer. The relative averaged density map is concatenated with the MS-CNN output, followed by two-stage CNN blocks to get the final density map. The experiments are demonstrated on three publicly available crowd density video datasets: Mall, UCSD, and Venice. We obtained promising and better results in terms of MAE and RMSE as compared with state-of-the-art approaches.
ISSN:2192-6611
2192-662X
DOI:10.1007/s13735-021-00220-7