Loading…

VideoLLM-online: Online Video Large Language Model for Streaming Video

Recent Large Language Models (LLMs) have been en-hanced with vision capabilities, enabling them to compre-hend images, videos, and interleaved vision-language con-tent. However, the learning methods of these large multi-modal models (LMMs) typically treat videos as predeter-mined clips, rendering th...

Full description

Saved in:
Bibliographic Details
Main Authors: Chen, Joya, Lv, Zhaoyang, Wu, Shiwei, Lin, Kevin Qinghong, Song, Chenan, Gao, Difei, Liu, Jia-Wei, Gao, Ziteng, Mao, Dongxing, Shou, Mike Zheng
Format: Conference Proceeding
Language:English
Subjects:
Online Access:Request full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Recent Large Language Models (LLMs) have been en-hanced with vision capabilities, enabling them to compre-hend images, videos, and interleaved vision-language con-tent. However, the learning methods of these large multi-modal models (LMMs) typically treat videos as predeter-mined clips, rendering them less effective and efficient at handling streaming video inputs. In this paper, we pro-pose a novel Learning-In- Video-Stream (LIVE) framework, which enables temporally aligned, long-context, and real-time dialogue within a continuous video stream. Our LIVE framework comprises comprehensive approaches to achieve video streaming dialogue, encompassing: (1) a training ob-jective designed to perform language modeling for contin-uous streaming inputs, (2) a data generation scheme that converts offline temporal annotations into a streaming di-alogue format, and (3) an optimized inference pipeline to speed up interactive chat in real-world video streams. With our LIVE framework, we develop a simplified model called VideoLLM-online and demonstrate its significant advan-tages in processing streaming videos. For instance, our VideoLLM-online-7B model can operate at over 10 FPS on an A100 GPU for a 5-minute video clip from Ego4D narration. Moreover, VideoLLM-online also showcases state-of-the-art performance on public offline video bench-marks, such as recognition, captioning, and forecasting. The code, model, data, and demo have been made available at showlab.github. iolvideollm-online.
ISSN:2575-7075
DOI:10.1109/CVPR52733.2024.01742