QUICK REVIEW

[论文解读] VideoLLM: Modeling Video Sequence with Large Language Models

Chen Guo, Yin-Dong Zheng|arXiv (Cornell University)|May 22, 2023

Multimodal Machine Learning Applications被引用 14

一句话总结

VideoLLM 使用模态编码器和语义翻译器将视频序列映射到统一的代币流，从而使仅解码器的 LLM 能够在参数高效微调下执行多样的视频序列理解任务。

ABSTRACT

With the exponential growth of video data, there is an urgent need for automated technology to analyze and comprehend video content. However, existing video understanding models are often task-specific and lack a comprehensive capability of handling diverse tasks. The success of large language models (LLMs) like GPT has demonstrated their impressive abilities in sequence causal reasoning. Building upon this insight, we propose a novel framework called VideoLLM that leverages the sequence reasoning capabilities of pre-trained LLMs from natural language processing (NLP) for video sequence understanding. VideoLLM incorporates a carefully designed Modality Encoder and Semantic Translator, which convert inputs from various modalities into a unified token sequence. This token sequence is then fed into a decoder-only LLM. Subsequently, with the aid of a simple task head, our VideoLLM yields an effective unified framework for different kinds of video understanding tasks. To evaluate the efficacy of VideoLLM, we conduct extensive experiments using multiple LLMs and fine-tuning methods. We evaluate our VideoLLM on eight tasks sourced from four different datasets. The experimental results demonstrate that the understanding and reasoning capabilities of LLMs can be effectively transferred to video understanding tasks. We release the code at https://github.com/cg1177/VideoLLM.

研究动机与目标

激励将序列推理从大型语言模型（LLMs）转移到视频序列理解。
开发一个即插即用的框架（Modality Encoder + Semantic Translator）来对齐视觉与文本模态。
使仅解码器的 LLM 能在最少的任务特定定制下执行多样的视频任务。

提出的方法

将视频编码成短期视觉单元，通过时序级划分；池化为时序令牌。
使用轻量语义译者将视觉语义转换为语言语义。
将仅解码器 LLM 作为通用视频序列推理器，并配置任务头以支持各种任务。
采用三种微调方案（basic tuning, partial tuning, PEFT）来高效适配 LLMs。
在四个数据集上，对八项任务使用多样的 LLMs（GPT-2, T5 Decoder, OPT, etc.）。

实验结果

研究问题

RQ1当与视觉到语言翻译器结合时，冻结或轻微微调的 LLMs 能否对视频序列进行推理？
RQ2不同的 LLM 架构与微调方法如何影响在各种视频序列任务上的性能？
RQ3单一的仅解码器 LLM 是否足以同时处理仅视觉与视觉-语言的视频理解任务？
RQ4VideoLLM 随着 LLM 参数增加在各任务上的可扩展性是如何？
RQ5所提出的适配原则与基线基于任务的指标相比，在八个视频任务上的表现如何？

主要发现

VideoLLM 在相较于任务特定模型的七个视频序列任务上实现具竞争力的或者最新状态的结果。
不同的基础 LLM 展现出依任务而定的优势；OPT 通常在在线行为检测与时刻相关任务上表现出色，而 T5 Decoder 在密集预测场景中表现优越。
PEFT 微调，结合 prefix tuning，在所测试的设置下，可将 OAD recall 提升约 1.3 点，相较于 basic tuning。
增大 LLM 大小在某些情况下提高性能到一定程度（例如 OPT-1.3B 显示出色），而极大模型在某些设置下回报递减。
在各任务中，VideoLLM 使用约 2M 到 15M 的可训练参数，主要位于语义译者与任务头，表明参数效率。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。