Loading…
Video Captioning With Adaptive Attention and Mixed Loss Optimization
The attention mechanism and sequence-to-sequence framework have shown promising advancements in the temporal task of video captioning. However, imposing the attention mechanism on non-visual words, such as "of" and "the", may mislead the decoder and decrease the overall performan...
Saved in:
Published in: | IEEE access 2019, Vol.7, p.135757-135769 |
---|---|
Main Authors: | , |
Format: | Article |
Language: | English |
Subjects: | |
Citations: | Items that this one cites Items that cite this one |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | The attention mechanism and sequence-to-sequence framework have shown promising advancements in the temporal task of video captioning. However, imposing the attention mechanism on non-visual words, such as "of" and "the", may mislead the decoder and decrease the overall performance of video captioning. Furthermore, the traditional sequence to sequence framework optimizes the model by using word-level cross entropy loss, which results in an exposure bias problem. This problem occurs because, at test time, the model uses the previously generated words to predict the next word, while it maximizes the likelihood of the next ground-truth word with consideration of the true previous one during training. To address these issues, we propose the reinforced adaptive attention model (RAAM), which integrates an adaptive attention mechanism with long short-term memory to flexibly utilize visual signals and language information as needed. Accordingly, the model is trained with both word-level loss and sentence-level loss to take advantage of these two losses and alleviate the exposure bias problem by directly optimizing the sentence-level metric using a reinforcement learning algorithm. Besides, a novel training method is proposed for mixed loss optimization. Experiments on the Microsoft Video Description benchmark corpus (MSVD) and the challenging MPII-MD Movie Description dataset demonstrate that the proposed RAAM method, which uses only a single feature, achieves competitive or even superior results compared to existing state-of-the-art models for video captioning. |
---|---|
ISSN: | 2169-3536 2169-3536 |
DOI: | 10.1109/ACCESS.2019.2942000 |