[论文解读] DoraemonGPT: Toward Understanding Dynamic Scenes with Large Language Models (Exemplified as A Video Agent)
DoraemonGPT 是一个基于大语言模型驱动的动态视频任务代理,使用与任务相关的符号记忆、子任务工具、外部知识,以及一个蒙特卡洛树搜索规划器来探索多种解决方案并提供改进的答案;它在 NExT-QA 上优于若干基线。
Recent LLM-driven visual agents mainly focus on solving image-based tasks, which limits their ability to understand dynamic scenes, making it far from real-life applications like guiding students in laboratory experiments and identifying their mistakes. Hence, this paper explores DoraemonGPT, a comprehensive and conceptually elegant system driven by LLMs to understand dynamic scenes. Considering the video modality better reflects the ever-changing nature of real-world scenarios, we exemplify DoraemonGPT as a video agent. Given a video with a question/task, DoraemonGPT begins by converting the input video into a symbolic memory that stores task-related attributes. This structured representation allows for spatial-temporal querying and reasoning by well-designed sub-task tools, resulting in concise intermediate results. Recognizing that LLMs have limited internal knowledge when it comes to specialized domains (e.g., analyzing the scientific principles underlying experiments), we incorporate plug-and-play tools to assess external knowledge and address tasks across different domains. Moreover, a novel LLM-driven planner based on Monte Carlo Tree Search is introduced to explore the large planning space for scheduling various tools. The planner iteratively finds feasible solutions by backpropagating the result's reward, and multiple solutions can be summarized into an improved final answer. We extensively evaluate DoraemonGPT's effectiveness on three benchmarks and several in-the-wild scenarios. The code will be released at https://github.com/z-x-yang/DoraemonGPT.
研究动机与目标
- 激发并解决理解超越静态图像的动态场景的需求。
- 提出一个基于记忆和工具的LLM框架,用于动态视频任务。
- 利用MCTS规划器高效地探索大型规划空间并生成多种可行解。
- 整合外部知识源,以扩展对领域的理解,超越模型内部知识。
提出的方法
- 将动态任务分解为任务相关的符号记忆,包含空间主导和时间主导属性。
- 引入由LLMs驱动的子任务工具,以查询符号记忆并进行推理(如 为什么、怎么、何时、什么、计数)。
- 通过知识工具(符号、文本、网络)整合外部知识源,以满足领域特定需求。
- 使用蒙特卡洛树搜索规划器来探索多条解法路径、反向传播奖励,并总结多种可行答案。
- 采用记忆增强、即插即用的架构,兼容各种基础模型和视频应用。
实验结果
研究问题
- RQ1我们如何有效地将动态视频内容转换为用于推理的任务相关符号记忆?
- RQ2基于MCTS的规划器是否能高效地在动态视频任务中探索子任务工具执行的巨大规划空间?
- RQ3整合外部知识是否能提升基于视频的推理在事实正确性方面超越LLM内部知识?
- RQ4与现有的基于LLM的和监督模型相比,提出的DoraemonGPT在动态视频推理基准上的表现如何?
主要发现
| 方法 | 分组 | Acc_C | Acc_T | Acc_D | 平均值 | Acc_A |
|---|---|---|---|---|---|---|
| HME | val | 46.2 | 48.2 | 58.3 | 50.9 | 48.7 |
| VQA-T | val | 41.7 | 44.1 | 60.0 | 48.6 | 45.3 |
| ATP | val | 53.1 | 50.2 | 66.8 | 56.7 | 54.3 |
| VGT | val | 52.3 | 55.1 | 64.1 | 57.2 | 55.0 |
| VGT | s_val | 49.7 | 53.3 | 63.7 | 55.6 | 55.6 |
| MIST | val | 54.6 | 56.6 | 66.9 | 59.3 | 57.2 |
| MIST | s_val | 51.7 | 55.3 | 67.0 | 58.0 | 58.0 |
| ViperGPT | ICCV | 29.7 | 37.3 | 47.3 | 38.1 | 38.1 |
| ViperGPT | s_val | 33.0 | 40.1 | 48.8 | 40.8 | 40.8 |
| VideoChat | s_val | - | 46.7 | 45.3 | 61.0 | 51.0 |
| DoraemonGPT | s_val | 52.3 | 45.7 | 64.0 | 54.0 | 54.0 |
- DoraemonGPT在NExT-QA上取得了竞争性结果,在若干指标上优于ViperGPT(如因果/时序/描述性推理) 。
- 在NExT-QA s_val上,DoraemonGPT得分为52.3(Acc_C)、45.7(Acc_T)、64.0(Acc_D),平均54.0,Acc_A为54.0。
- 带有任务相关记忆的MCTS规划器能够探索多种可行解,随着N的增加提高准确率(例如从1个解到4个解)。
- 结合空间主导和时间主导的记忆可带来最佳性能,证实动态问题需要这两种记忆。
- DoraemonGPT通过整合外部知识并汇总多个中间结果,在野外场景中显示出鲁棒性。
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。