QUICK REVIEW

[論文レビュー] MinVIS: A Minimal Video Instance Segmentation Framework without Video-based Training

De-An Huang, Zhiding Yu|arXiv (Cornell University)|Aug 3, 2022

Advanced Image and Video Retrieval Techniques被引用数 28

ひとこと要約

MinVIS は画像ベースのクエリ駆動モデルのみを訓練し、オンラインクエリマッチングを介してフレーム間でインスタンスを追跡することで、動画ベースの訓練手法なしに最先端の動画インスタンスセグメンテーションを実現します。

ABSTRACT

We propose MinVIS, a minimal video instance segmentation (VIS) framework that achieves state-of-the-art VIS performance with neither video-based architectures nor training procedures. By only training a query-based image instance segmentation model, MinVIS outperforms the previous best result on the challenging Occluded VIS dataset by over 10% AP. Since MinVIS treats frames in training videos as independent images, we can drastically sub-sample the annotated frames in training videos without any modifications. With only 1% of labeled frames, MinVIS outperforms or is comparable to fully-supervised state-of-the-art approaches on YouTube-VIS 2019/2021. Our key observation is that queries trained to be discriminative between intra-frame object instances are temporally consistent and can be used to track instances without any manually designed heuristics. MinVIS thus has the following inference pipeline: we first apply the trained query-based image instance segmentation to video frames independently. The segmented instances are then tracked by bipartite matching of the corresponding queries. This inference is done in an online fashion and does not need to process the whole video at once. MinVIS thus has the practical advantages of reducing both the labeling costs and the memory requirements, while not sacrificing the VIS performance. Code is available at: https://github.com/NVlabs/MinVIS

研究の動機と目的

動画ベースの訓練やアーキテクチャなしで競争力のある VIS パフォーマンスを達成できることを示す。
画像インスタンスセグメンテーションから学習したクエリ埋め込みが、クエリマッチングを介してフレーム間でインスタンスを追跡するために有用な時間的一貫した表現を提供することを示す。
ごく少ないフレーム注釈（1％程度）での訓練が YouTube-VIS で競争力を維持し、 Occlusion が多いデータ（OVIS）で卓越することを示す。
フレーム内のクエリの分離とフレーム間の時間的一貫性が、手作りのヒューリスティックなしで追跡を可能にすることを分析する。

提案手法

独立したフレーム上でクエリベースの画像インスタンスセグメンテーションモデルを訓練する（Image Encoder + Transformer Decoder）。
segmentation マスクが最終画像特徴マップにクエリ埋め込みを畳み込むことによって生成されることを課す（M = sigmoid(Q * F_{-1})）。
連続するフレーム間でクエリ埋め込みをオンライン的な二部整列（コサイン類似度を用いる）で追跡する（Hungarian アルゴリズム）。
監督にはビデオベースの訓練損失を使用せず、画像ベースの損失（分類損失とマスク損失）と二部整列のみを用いる。
モデルや訓練手続きを変更せず、注釈のサブサンプリング（最小1％）を大幅に許容する。
オプションとして、純粋なクエリベース追跡とヒューリスティック後処理を比較し、手作り追跡ルールの必要性を検証する。

実験結果

リサーチクエスチョン

RQ1競争力のあるビデオインスタンスセグメンテーション（VIS）は、ビデオベースの訓練やアーキテクチャなしで実現できるのか。
RQ2画像インスタンスセグメンテーションから学習したクエリ埋め込みが、フレーム間での追跡に適した時間的一貫した表現を提供するのか。
RQ3 sparse なフレーム注釈での訓練が、標準データセットおよび Occlusion が多いデータセットでの VIS パフォーマンスにどのように影響するのか。

主な発見

方法	バックボーン	学習	AP	AP 50	AP 75	AR 1	AR 10
TeViT	R50	Full	42.1	67.8	44.8	41.3	49.4
TeViT	MsgShifT	Full	46.6	71.3	51.6	44.9	54.3
SeqFormer	R50	Full	45.1	66.9	50.5	45.6	54.6
SeqFormer	R50	Full+C80k	47.4	69.8	51.8	45.5	54.8
Mask2Former-VIS	R50	Full	46.4	68.0	50.0	–	–
MinVIS	R50	Full	47.4	69.0	52.1	45.7	55.7
TeViT	Swin-L	Full	56.8	80.6	63.1	52.0	63.3
SeqFormer	Swin-L	Full+C80k	59.3	82.1	66.4	51.7	64.4
Mask2Former-VIS	Swin-L	Full	60.4	84.4	67.0	–	–
MinVIS	Swin-L	Full	61.6	83.3	68.6	54.8	66.6
MinVIS	Swin-L	1%	59.0	81.6	64.7	54.0	64.0
MinVIS	Swin-L	5%	59.3	81.4	65.8	53.8	64.1
MinVIS	Swin-L	10%	61.0	83.0	67.7	54.6	66.1

MinVIS（ResNet-50 バックボーン）で YouTube-VIS 2019 の AP が 47.4、Mask2Former-VIS ベースラインが 46.4、MinVIS（Swin-L）で YouTube-VIS 2019 の AP が 61.6、YouTube-VIS 2021 の AP が 55.3、OVIS の AP が 39.4（さまざまな設定）。
ラベル付き訓練フレームのわずか 1% だけでも、MinVIS は競争力のある AP を維持（例：YouTube-VIS 2019 で 59.0、YouTube-VIS 2021 で 52.9、OVIS ではバックボーンごとに 31.7–39.4）し、注釈削減に対する堅牢性が高いことを示す。
YouTube-VIS 2019/2021 で、Swin-L を用いた MinVIS は最先端の per-clip 手法を上回るまたは同等に、OVIS では大幅な改善を達成（例：Swin-L で 39.4 AP、ベースラインの 25.8–28.9）。
ヒューリスティックなしのクエリベース追跡は強い時間的関連性を生み出す。アブレーションでは、クエリマッチングのみを用いる方が、 tested データセットでヒューリスティックと組み合わせるより良好または同等の性能を示す。
オンライン推論とフレームごとの処理をサポートし、全動画を一度に読み込む必要がないため、メモリとラベリングコストを削減できる。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。