Loading…

Video Captioning Based on Joint Image-Audio Deep Learning Techniques

With the advancement in technology, deep learning has been widely used for various multimedia applications. Herein, we utilized this technology to video captioning. The proposed system uses different neural networks to extract features from image, audio, and semantic signals. Image and audio feature...

Full description

Saved in:
Bibliographic Details
Main Authors: Wang, Chien-Yao, Liaw, Pei-Sin, Liang, Kai-Wen, Wang, Jai-Ching, Chang, Pao-Chi
Format: Conference Proceeding
Language:English
Subjects:
Online Access:Request full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:With the advancement in technology, deep learning has been widely used for various multimedia applications. Herein, we utilized this technology to video captioning. The proposed system uses different neural networks to extract features from image, audio, and semantic signals. Image and audio features are concatenated before being fed into a long short-term memory (LSTM) for initialization. The joint audio-image features help the entire semantics to form a network with better performance. A bilingual evaluation understudy algorithm (BLEU) - an automatic speech scoring mechanism - was used to score sentences. We considered the length of the word group (one word to four words); with the increase of all BLEU scores by more than 1%, the CIDEr-D score increased by 2.27%, and the METEOR and ROUGE-L scores increased by 0.2% and 0.7%, respectively. The improvement is highly significant.
ISSN:2166-6822
DOI:10.1109/ICCE-Berlin47944.2019.8966173