Skip to main content
QUICK REVIEW

[论文解读] QuickGrasp: Responsive Video-Language Querying Service via Accelerated Tokenization and Edge-Augmented Inference

Miao Zhang, Ruixiao Zhang|arXiv (Cornell University)|Feb 23, 2026
Multimodal Machine Learning Applications被引用 0
一句话总结

QuickGrasp enables local-first video-language querying with on-demand edge augmentation to match large VLM accuracy while greatly reducing response delay (up to 12.8x).

ABSTRACT

Video-language models (VLMs) are reshaping video querying services, bringing unified solutions to complex perception and reasoning tasks. However, deploying large VLMs in real-world systems remains challenging due to their high resource demands, and remote-based deployment often results in unacceptable response delays. Although small, locally deployable VLMs offer faster responses, they unavoidably fall short in accuracy. To reconcile this trade-off, we propose QuickGrasp, a responsive, quality of service (QoS)-aware system that bridges this gap through a local-first architecture with on-demand edge augmentation. Built upon the highly modular architecture of VLMs, QuickGrasp shares the vision representation across model variants to avoid redundant computation. To maximize system-wide efficiency, QuickGrasp introduces three key designs: accelerated video tokenization, query-adaptive edge augmentation, and delay-aware, accuracy-preserving vision token density configuration. We implement a prototype of QuickGrasp and evaluate it across multiple video understanding benchmarks. The results show that QuickGrasp matches the accuracy of large VLMs while achieving up to a 12.8x reduction in response delay. QuickGrasp represents a key advancement toward building responsive video querying services for open-world understanding that fully leverage the capabilities of VLMs.

研究动机与目标

  • Motivate building responsive video-querying services that balance local processing with edge augmentation to meet QoS needs.
  • Bridge the accuracy gap between small, locally deployable VLMs and large remote VLMs.
  • Eliminate or reduce the end-to-end response delay caused by video tokenization and remote inference through architectural design.
  • Provide a modular, edge-assisted pipeline that reuses vision representations to minimize redundant computation.

提出的方法

  • Accelerated video tokenization via keyframe-aligned sampling and pipelined video-to-token conversion to reduce decoding and sampling delays.
  • Query-adaptive edge augmentation that reuses local vision tokens at the edge to avoid reprocessing video data.
  • Confidence-based routing that calibrates local model confidence using temperature scaling to decide edge offloading.
  • QoS-aware token density configuration that treats token density as a tunable parameter via a contextual multi-armed bandit to balance accuracy and delay.
  • Prototype implementation of QuickGrasp and evaluation on multiple video understanding benchmarks showing latency reductions with preserved accuracy.

实验结果

研究问题

  • RQ1Can a local-first video-language querying system achieve the accuracy of large remote VLMs while significantly reducing end-to-end latency?
  • RQ2How can accelerated tokenization and edge augmentation be orchestrated to minimize communication without sacrificing task accuracy?
  • RQ3Can confidence calibration and CMAB-based token-density control effectively decide when to offload to edge inference?
  • RQ4What is the impact of shared vision representations on cross-model collaboration for edge-augmented VLM inference?

主要发现

  • The system can match the accuracy of large VLMs while reducing response delay by up to 12.8x.
  • Video tokenization is a major source of latency, especially for long videos, and accelerations in this stage yield substantial gains.
  • Sharing vision representations across local and edge models reduces redundant computation and enables efficient edge augmentation.
  • Calibrated confidence with temperature scaling improves routing decisions for edge augmentation, reducing misclassification risk.
  • A CMAB-based adaptive token density configuration effectively balances accuracy and delay across query types.

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。