QUICK REVIEW

[论文解读] VCT: A Video Compression Transformer

Fabian Mentzer, George Toderici|arXiv (Cornell University)|Jun 15, 2022

Advanced Vision and Imaging被引用 39

一句话总结

本文用基于 Transformer 的时序熵模型替代运动预测和扭曲，以通过将帧编码为表示并预测其分布来进行熵编码来压缩视频，在标准数据集上实现了最先进的率失真性能，并不带有架构偏置。

ABSTRACT

We show how transformers can be used to vastly simplify neural video compression. Previous methods have been relying on an increasing number of architectural biases and priors, including motion prediction and warping operations, resulting in complex models. Instead, we independently map input frames to representations and use a transformer to model their dependencies, letting it predict the distribution of future representations given the past. The resulting video compression transformer outperforms previous methods on standard video compression data sets. Experiments on synthetic data show that our model learns to handle complex motion patterns such as panning, blurring and fading purely from data. Our approach is easy to implement, and we release code to facilitate future research.

研究动机与目标

在神经视频压缩中推动去除手工设计的架构偏置。
提出一个基于 Transformer 的时序熵模型，用于预测帧表示的分布。
证明独立帧编码加上基于 Transformer 的上下文在标准数据集上优于基于运动的先前方法。
通过合成数据实验展示对多样化时序模式的鲁棒性。

提出的方法

通过图像编码器 E 和解码器 D，将 x_i 独立编码为量化表示 y_i。
使用基于 Transformer 的模型来预测 P(y_i | y_{i-2}, y_{i-1})，以对 y_i 进行无损熵编码。
将 y_i 拆分成块以获得 token，并运行独立的 Transformer 以建模时序和空间上下文。
分三个阶段训练（阶段 I：对 E, D 的 RD 训练；阶段 II：训练基于 Transformer 的 PMF 预测器；阶段 III：联合微调，结合 RD 损失和失真）。
可选地应用潜在残差预测器（LRP）以在不传播时序误差的情况下增强重建。

实验结果

研究问题

RQ1基于 Transformer 的时序熵模型是否能够取代神经视频压缩中的运动预测与扭曲？
RQ2两帧过去的上下文加块状自回归 token 能在多大程度上支持对帧表示的有效熵编码？
RQ3上下文长度和潜在残差预测对率失真性能有何影响？
RQ4基于 Transformer 的模型是否能对传统先验未明确编码的合成时序模式（平移、模糊、渐变）具有泛化能力？

主要发现

VCT 在标准数据集上的 PSNR 和 MS-SSIM 指标均优于以往的神经视频压缩方法，且不依赖运动/扭曲先验。
使用两帧过去的上下文可显著降低比特率，相较于没有时序上下文还获得额外的潜在残差预测收益。
该方法在合成数据中的多样时序模式（平移、模糊、渐变）处理效果优于依赖运动先验的基线。
延迟/运行时分析显示，在多种分辨率和基于 TPU 的推理下，解码速度具备竞争力。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。