Loading…

Stateful human-centered visual captioning system to aid video surveillance

•This paper describes approaches of caption generation that help to integrate a form of attention with variants. There are two types of variants: a soft attention technique and a hard attention technique.•Due to recent advancements in caption generation and employment of attention in machine transla...

Full description

Saved in:
Bibliographic Details
Published in:Computers & electrical engineering 2019-09, Vol.78, p.108-119
Main Authors: Saleem, Summra, Dilawari, Aniqa, Khan, Usman Ghani, Iqbal, Razi, Wan, Shaohua, Umer, Tariq
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Items that cite this one
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:•This paper describes approaches of caption generation that help to integrate a form of attention with variants. There are two types of variants: a soft attention technique and a hard attention technique.•Due to recent advancements in caption generation and employment of attention in machine translation, we further investigate those models, which involve salient part of a video and will generate its caption.•By visualizing what and where the attention should be focused, we interpret the results of this framework.•We also validate the advantages of attention in caption generation by using state of art performance on benchmark dataset. The study of Natural Language Generation (NLG), especially how human beings narrate the world, assists in understanding of the visual world for surveillance. Our research proposes an effective technique to axiomatically develop multi-line textual description of visual data by exploiting deep Convolution Neural Networks (CNN). Textual description of visual data aids in providing textual tags for visual information. A human can retrieve elected videos from a repository based on visual tags. Videos contain more complex and detailed information than images and provide more language data. The proposed feats-rich model encodes the visual contents to visual and facial features using CNN architecture. Encoded features are passed to two layer LSTM units with attention mechanism, reducing the number of parameters by encompassing relevant details. Experimental results on Trecvid 2016 and UET-Surveillance dataset depict that model outperforms state-of-the-art methods by scoring BLEU score of 0.35 and 0.52, respectively.
ISSN:0045-7906
1879-0755
DOI:10.1016/j.compeleceng.2019.07.009