QUICK REVIEW

[論文レビュー] HyperTokens: Controlling Token Dynamics for Continual Video-Language Understanding

Toan Nguyen, Yang Liu|arXiv (Cornell University)|Mar 2, 2026

Multimodal Machine Learning Applications被引用数 0

ひとこと要約

HyperTokens は継続的な VideoQA のためにオンデマンドでタスク条件付きプロンプト・トークンを生成するハイパーネットワークを導入し、忘却を防ぎつつクロスモーダル転送を可能にするメタ正則化付きのlook-ahead 目的を持つ。

ABSTRACT

Continual VideoQA with multimodal LLMs is hindered by interference between tasks and the prohibitive cost of storing task-specific prompts. We introduce HyperTokens, a transformer-based token generator that produces fine-tuning tokens on demand, giving explicit control over prompt updates while keeping memory fixed. To suppress forgetting, we propose meta-inspired regularisers that look ahead to avoid task-specific sharp directions and anchor the evolving generator to prior tasks. We further connect our objective to sharpness-aware optimisation, providing insight into why it encourages flatter cross-task minima and improves retention. Beyond regularisation, HyperTokens exploits lightweight auxiliary multimodal supervision through shared generation weights; guided by a causal perspective, we design feasible objectives and surrogate mutual-information losses to regularise anti-causal cross-modal directions. Across two standard continual VideoQA benchmarks, HyperTokens achieves higher average accuracy with substantially lower forgetting. Finally, we introduce a challenging cross-modal ImageQA->VideoQA protocol and show that HyperTokens enables robust continual transfer in this setting.

研究の動機と目的

distribution shifts 下の継続的 VideoQA の動機づけとタスク特化プロンプトの限界.
要求に応じてタスクプロンプトを合成するスケーラブルでメモリ制約付きトークン生成器の提案.
忘却を防ぎつつ多模態信号を整合させる正則化および補助損失の開発。

提案手法

低次元タスクコード z^t からオンデマンドのタスクプロンプト P^t を生成する transformer ベースのトークン生成器 H_phi を提案。
LookAhead-Regularisation (LA-Reg) を導入し、小さな内ループ（M ステップ）でタスク間の generator drift を制約。
コントラストプロトタイプ損失を用いて動画特徴と質問特徴を横断する軽量エンコーダ g_omega でタスクコード z^t を学習。
訓練中に固定バックボーンで multimodal LLM に供給されるプロンプト・トークンを生成するタスクコード条件付きプロンプトエンコーダを使用。
因果的実現可能性損失 p(Q|V,A,P) や InfoNCE なクロスモーダル相互情報代替指標などの補助学習信号を適用してトークンと動画表現を正則化。
テスト時には SI-正則化エンコーダを用いたエンコーディングで明示的なタスクIDなしで堅牢なタスクコード検索を提供。

Figure 1 : HyperTokens overview. ( Left ) Continual adaptation with HyperTokens for VideoQA and cross-modal transfer VisualQA $\rightarrow$ VideoQA. A fixed-size generator synthesises task-specific fine-tuning tokens. ( Middle ) Task-code learning via a multimodal contrastive objective with a protot

実験結果

リサーチクエスチョン

RQ1HyperTokens は強力なプロンプトベースのベースラインと比較して、忘却を大幅に抑えつつ継続的な VideoQA の精度を向上させることができるか？
RQ2look-ahead 正則化と因果整合補助損失はタスク間の安定性と多模態のグラウンディングを改善するか？
RQ3ImageQA → VideoQA におけるクロスモーダル転送は HyperTokens 下で頑健かつ最先端のベースラインと比較してどうか？

主な発見

Method	NextQA Acc	NextQA Fog	DramaQA Acc	DramaQA Fog
LLaMA-Adapter (ICLR’24)	46.58	13.83	60.99	24.39
L2P (CVPR’22)	48.82	12.25	62.50	20.67
DualPrompt (ECCV’22)	50.62	11.74	65.89	17.93
LAE (ICCV’23)	49.38	11.47	65.82	17.35
ProgPrompt (ICLR’23)	53.95	10.69	67.92	14.95
ColPro (EMNLP’24)	55.14	7.43	71.24	12.64
DAM (WACV’25)	53.88	9.99	67.37	15.19
Bisecle (NeurIPS’25)	62.37	5.34	71.49	10.37
HyperTokens (ours)	64.75	3.62	71.62	9.84

HyperTokens は 2 つの継続的 VideoQA ベンチマーク（NExT-QA と DramaQA）で、強力なベースラインより平均精度が高く忘却が少ない。
NExT-QA では HyperTokens は 64.75 Acc と 3.62 Fog を達成し、Bisecle の 64.75 vs 62.37 Acc および 3.62 vs 5.34 Fog を上回る。
DramaQA では HyperTokens は 71.62 Acc と 9.84 Fog を達成し、以前の手法を上回り忘却の顕著な低減を示す。
ImageQA → VideoQA において HyperTokens は Bisecle を上回り（60.07 vs 55.32 Acc; 4.97 vs 6.31 Fog）、ネガティブ転移が緩和される頑健なクロスモーダル転送を示す。
アブレーションでは、コントラストタスクコード損失と正則化が精度と忘却低減に最も寄与し、look-ahead ステップを増やすほど一貫して性能が向上する。

Figure 2 : Geometry of LA-Reg in optimisation space. LA-Reg steers optimisation into the shared low-loss region (green)—a flatter minima basin across tasks—by balancing progress along the task- $t$ direction and alignment with the task- $(t\!-\!1)$ anchor direction. Note that we regularise in the ou

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。