Loading…

KD-VSUM: A Vision Guided Models for Multimodal Abstractive Summarization with Knowledge Distillation

Multimodal abstract summarization is increasingly attracting attention due to its ability to synthesize information from different source modalities and generate high-quality text summaries. Concurrently, there has been significant development in multimodal abstract summarization models for videos....

Full description

Saved in:

Bibliographic Details
Main Authors:	Zheng, Zehong, Li, Changlong, Hu, Wenxin, Wang, Su
Format:	Conference Proceeding
Language:	English
Subjects:	Abstractive Summarization Education Feature extraction Focusing Knowledge Distillation Limiting Multimodality Neural networks Semantics Spatiotemporal phenomena
Online Access:	Request full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	Multimodal abstract summarization is increasingly attracting attention due to its ability to synthesize information from different source modalities and generate high-quality text summaries. Concurrently, there has been significant development in multimodal abstract summarization models for videos. These models are capable of extracting information from multimodal data and generating abstract summaries. Most existing modeling approaches primarily concentrate on instructional videos, such as those teaching sports or life skills, thereby limiting their ability to capture the complexity of dynamic environments in the general world. In this paper, we propose a vision-guided model for multimodal abstractive summarization with knowledge distillation KD-VSUM to address the lack of generalized video domain capabilities in video summarization. This approach includes a vision-guided encoder, which enables the model to better focus on the global spatial and temporal information of video frames. We capitalize on knowledge distillation from multimodal pre-trained video-language models to enhance model performance. We introduce the VersaVision dataset, which includes a broader range of video domains and a higher proportion of medium to long videos. The results demonstrate that our model surpasses existing state-of-the-art models on the VersaVision dataset, achieving ROUGE scores of 1.7 in ROUGE-1, 1.8 in ROUGE-2, and 2 in ROUGE-L. These findings underscore the substantial improvements that the integration of a global vision guided and knowledge distillation can bring to the task of video summary extraction.
ISSN:	2161-4407
DOI:	10.1109/IJCNN60899.2024.10651189