QUICK REVIEW

[論文レビュー] Learning Generative Vision Transformer with Energy-Based Latent Space for Saliency Prediction

Jing Zhang, Jianwen Xie|arXiv (Cornell University)|Dec 27, 2021

Visual Attention and Saliency Detection被引用数 45

ひとこと要約

この論文は、潜在変数が情報量のあるエネルギーベース priors に従う生成的ビジョン・トランスフォーマを用いた顕著性予測を提案する。訓練は Langevin ダイナミクスを用いた MCMC ベースの最大尤度推定により、トランスフォーマーと prior を同時に学習し、ピクセル単位の不確実性マップを可能にする。

ABSTRACT

Vision transformer networks have shown superiority in many computer vision tasks. In this paper, we take a step further by proposing a novel generative vision transformer with latent variables following an informative energy-based prior for salient object detection. Both the vision transformer network and the energy-based prior model are jointly trained via Markov chain Monte Carlo-based maximum likelihood estimation, in which the sampling from the intractable posterior and prior distributions of the latent variables are performed by Langevin dynamics. Further, with the generative vision transformer, we can easily obtain a pixel-wise uncertainty map from an image, which indicates the model confidence in predicting saliency from the image. Different from the existing generative models which define the prior distribution of the latent variables as a simple isotropic Gaussian distribution, our model uses an energy-based informative prior which can be more expressive to capture the latent space of the data. We apply the proposed framework to both RGB and RGB-D salient object detection tasks. Extensive experimental results show that our framework can achieve not only accurate saliency predictions but also meaningful uncertainty maps that are consistent with the human perception.

研究の動機と目的

顕著性予測を人間のような不確実性を捉える確率的・確率過程的タスクとして動機付ける。
潜在変数とエネルギーベース priors を持つ生成的ビジョン・トランスフォーマを導入し、画像 given に対する顕著性をモデル化する。
トランスフォーマーとエネルギーベース priors を、MCMC ベースの最大尤度推定を用いて共同訓練する。
RGB および RGB-D データを横断して、枠組みが正確な顕著性予測と意味のあるピクセル単位の不確実性マップを生むことを示す。

提案手法

画像 I と潜在 z から顕著性 s を生成する条件付き生成モデルを定式化。s = T_theta(I, z) + epsilon により、z はエネルギーベース priρ p_alpha(z) から draw される。
prior を p_alpha(z) ~ exp(-U_alpha(z)) p0(z) のエネルギーベースモデルとして定義。U_alpha は MLP ベースのエネルギー関数、p0(z) はガウス参照分布で、表現力のある潜在空間を可能にする。
最大尤度で訓練。計算が困難な期待値を近似するため、Langevin ダイナミクスを用いて p_alpha(z) および p_beta(z|s,I) からサンプリングする。
生成器には Swin Transformer をバックボーンとして用い、マルチスケール特徴を潜在変数 z と結合して顕著性マップ T_theta(I,z) を生成する特徴量集約デコーダを用いる。
エネルギー priors U_alpha とトランスフォーマーのパラメータ theta を学習。KL 基づく尤度目的関数から導出される勾配を更新してエネルギー pri_alpha とトランスフォーマーのパラメータ theta を学習。prior からの Langevin サンプル z^- と posterior からの z^+ を使用（式 6–9）。

実験結果

リサーチクエスチョン

RQ1エネルギーベース潜在 priors を持つビジョン・トランスフォーマが現実的で確率的な顕著性マップと対応する不確実性推定をモデル化できるか。
RQ2銘度なエネルギー基づく潜在空間を生成器と共同訓練することで、RGB および RGB-D の顕著物体検出が、識別的またはガウス priors の生成モデル基準より改善されるか。
RQ3標準の RGB および RGB-D ベンチマークにおける顕著性精度と不確実性可視化の観点で、提案フレームワークはどの程度の性能か。

主な発見

提案された枠組みは RGB および RGB-D ベンチマークで正確な顕著性予測を生み、人間の知覚と一致した意味のあるピクセル単位の不確実性マップを提供します。
エネルギーベース潜在 priors の使用は、簡易なガウス priors と比較して潜在空間の表現力を向上させ、顕著性条件付きモデリングを助ける。
生成的潜在変数フレームワークを備えた Swin Transformer バックボーンは、複数のベンチマークでいくつかの最先端 RGB および RGB-D 顕著性モデルを上回る。
学習は尤度ベースの MCMC に依存し、追加の近似推論ネットワークを必要としないため、GANs に典型的なモード崩壊と VAE に一般的な後方崩壊を回避する。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。