QUICK REVIEW

[论文解读] QuickGrasp: Responsive Video-Language Querying Service via Accelerated Tokenization and Edge-Augmented Inference

Miao Zhang, Ruixiao Zhang|arXiv (Cornell University)|Feb 23, 2026

Multimodal Machine Learning Applications被引用 0

一句话总结

QuickGrasp enables local-first video-language querying with on-demand edge augmentation to match large VLM accuracy while greatly reducing response delay (up to 12.8x).

ABSTRACT

Video-language models (VLMs) are reshaping video querying services, bringing unified solutions to complex perception and reasoning tasks. However, deploying large VLMs in real-world systems remains challenging due to their high resource demands, and remote-based deployment often results in unacceptable response delays. Although small, locally deployable VLMs offer faster responses, they unavoidably fall short in accuracy. To reconcile this trade-off, we propose QuickGrasp, a responsive, quality of service (QoS)-aware system that bridges this gap through a local-first architecture with on-demand edge augmentation. Built upon the highly modular architecture of VLMs, QuickGrasp shares the vision representation across model variants to avoid redundant computation. To maximize system-wide efficiency, QuickGrasp introduces three key designs: accelerated video tokenization, query-adaptive edge augmentation, and delay-aware, accuracy-preserving vision token density configuration. We implement a prototype of QuickGrasp and evaluate it across multiple video understanding benchmarks. The results show that QuickGrasp matches the accuracy of large VLMs while achieving up to a 12.8x reduction in response delay. QuickGrasp represents a key advancement toward building responsive video querying services for open-world understanding that fully leverage the capabilities of VLMs.

研究动机与目标

Motivate building responsive video-querying services that balance local processing with edge augmentation to meet QoS needs.
Bridge the accuracy gap between small, locally deployable VLMs and large remote VLMs.
Eliminate or reduce the end-to-end response delay caused by video tokenization and remote inference through architectural design.
Provide a modular, edge-assisted pipeline that reuses vision representations to minimize redundant computation.

提出的方法

Accelerated video tokenization via keyframe-aligned sampling and pipelined video-to-token conversion to reduce decoding and sampling delays.
Query-adaptive edge augmentation that reuses local vision tokens at the edge to avoid reprocessing video data.
Confidence-based routing that calibrates local model confidence using temperature scaling to decide edge offloading.
QoS-aware token density configuration that treats token density as a tunable parameter via a contextual multi-armed bandit to balance accuracy and delay.
Prototype implementation of QuickGrasp and evaluation on multiple video understanding benchmarks showing latency reductions with preserved accuracy.

实验结果

研究问题

RQ1Can a local-first video-language querying system achieve the accuracy of large remote VLMs while significantly reducing end-to-end latency?
RQ2How can accelerated tokenization and edge augmentation be orchestrated to minimize communication without sacrificing task accuracy?
RQ3Can confidence calibration and CMAB-based token-density control effectively decide when to offload to edge inference?
RQ4What is the impact of shared vision representations on cross-model collaboration for edge-augmented VLM inference?

主要发现

The system can match the accuracy of large VLMs while reducing response delay by up to 12.8x.
Video tokenization is a major source of latency, especially for long videos, and accelerations in this stage yield substantial gains.
Sharing vision representations across local and edge models reduces redundant computation and enables efficient edge augmentation.
Calibrated confidence with temperature scaling improves routing decisions for edge augmentation, reducing misclassification risk.
A CMAB-based adaptive token density configuration effectively balances accuracy and delay across query types.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。