QUICK REVIEW

[論文レビュー] HSVA: Hierarchical Semantic-Visual Adaptation for Zero-Shot Learning

Shiming Chen, Guo-Sen Xie|arXiv (Cornell University)|Sep 30, 2021

Domain Adaptation and Few-Shot Learning参考文献 57被引用数 84

ひとこと要約

HSVA は階層的な二段階適応（構造と分布）を、視覚特徴と意味特徴の学習のために2つの部分的に整列したVAEを用いて、内在的共通空間を学習し、ZSLとGZSLの性能を向上させる。異種モダリティ間の構造変動と分布整合性を明示的に対処する。

ABSTRACT

Zero-shot learning (ZSL) tackles the unseen class recognition problem, transferring semantic knowledge from seen classes to unseen ones. Typically, to guarantee desirable knowledge transfer, a common (latent) space is adopted for associating the visual and semantic domains in ZSL. However, existing common space learning methods align the semantic and visual domains by merely mitigating distribution disagreement through one-step adaptation. This strategy is usually ineffective due to the heterogeneous nature of the feature representations in the two domains, which intrinsically contain both distribution and structure variations. To address this and advance ZSL, we propose a novel hierarchical semantic-visual adaptation (HSVA) framework. Specifically, HSVA aligns the semantic and visual domains by adopting a hierarchical two-step adaptation, i.e., structure adaptation and distribution adaptation. In the structure adaptation step, we take two task-specific encoders to encode the source data (visual domain) and the target data (semantic domain) into a structure-aligned common space. To this end, a supervised adversarial discrepancy (SAD) module is proposed to adversarially minimize the discrepancy between the predictions of two task-specific classifiers, thus making the visual and semantic feature manifolds more closely aligned. In the distribution adaptation step, we directly minimize the Wasserstein distance between the latent multivariate Gaussian distributions to align the visual and semantic distributions using a common encoder. Finally, the structure and distribution adaptation are derived in a unified framework under two partially-aligned variational autoencoders. Extensive experiments on four benchmark datasets demonstrate that HSVA achieves superior performance on both conventional and generalized ZSL. The code is available at \url{https://github.com/shiming-chen/HSVA} .

研究の動機と目的

一段階の分布整合性を超えた、見られたクラスと見られていないクラス間の頑健な知識伝達を動機づける。
構造と分布の変動を共同で扱うことで、視覚と意味特徴の非均質性に対処する。
多模データの識別的で内在的な共通空間を学習する、統一的な二段階フレームワークを提案する。
多様なデータセットに跨るCZSLとGZSLベンチマークで優れた性能を示す。

提案手法

二つの部分的に整列した変分オートエンコーダを用いた階層的意味-視覚適応（HSVA）を提案。
二つのタスク特化エンコーダと監督付き敵対的差異を用いた構造適応（SA）で多様体を揃える。
共通エンコーダを用いて潜在ガウス分布間のWasserstein距離を最小化することによる分布適応（DA）。
視覚と意味モダリティ間の一貫性を維持するためのクロス再構成とVAEベースの損失。
最適化はVAE損失、クロス再構成、監督付き分類、SAD、SWDベースの差異、および見られたクラス/未見クラスのバイアス対策としてのiCORALを組み合わせる。
学習された分布整合共通空間で再パラメータ化されたエンコーダ表現を用いて分類。

実験結果

リサーチクエスチョン

RQ1階層的な二段階適応（構造→分布）は、ZSLにおいて一段階のアプローチより視覚と意味ドメインをより良く整列できるか？
RQ2SAを組み込むことは識別性を向上させ、モダリティ間の多様体のずれを減らすか？
RQ3共通エンコーダとWasserstein距離を用いた分布適応はGZSLにおける見られたクラス/未見クラスのバイアスにどのように影響するか？
RQ4SAとDAの要素が標準ベンチマークにおけるCZSLとGZSLの性能に与える影響は？

主な発見

データセット	U (未見)	S (既知)	H（調和平均）
AWA1	59.3	76.6	66.8

HSVA は CZSL において AWA1、CUB、SUN の各データセットで既存の共通空間法より一貫して改善を達成。
GZSL では HSVA が 4 つのベンチマーク全てで prior common-space methods より高い和の平均を達成し、特に SUN で顕著な改善。
アブレーションにより SA と DA の両方が重要で、特に coarser データセットで DA が大きな改善をもたらす。
iCORAL は未見クラスのエンコードを見たクラス領域から離すのを助け、seen-unseen バイアスに対処。
定性的な視覚化（t-SNE）は、HSVA が CADA-VAE のような one-step 手法と比較してより識別的で内在的な共通空間を学習することを示唆。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。