QUICK REVIEW

[논문 리뷰] HSVA: Hierarchical Semantic-Visual Adaptation for Zero-Shot Learning

Shiming Chen, Guo-Sen Xie|arXiv (Cornell University)|2021. 09. 30.

Domain Adaptation and Few-Shot Learning참고 문헌 57인용 수 84

한 줄 요약

HSVA는 두 개의 부분적으로 정렬된 VAE를 사용하여 시각적 및 의미적 특징의 고유 공통 공간을 학습하는 계층적 두 단계 적응(구조와 분포)을 도입하여 ZSL 및 GZSL 성능을 향상시킵니다. 이는 이질적인 모달리티 간의 구조 변화와 분포 정렬을 명시적으로 다룹니다.

ABSTRACT

Zero-shot learning (ZSL) tackles the unseen class recognition problem, transferring semantic knowledge from seen classes to unseen ones. Typically, to guarantee desirable knowledge transfer, a common (latent) space is adopted for associating the visual and semantic domains in ZSL. However, existing common space learning methods align the semantic and visual domains by merely mitigating distribution disagreement through one-step adaptation. This strategy is usually ineffective due to the heterogeneous nature of the feature representations in the two domains, which intrinsically contain both distribution and structure variations. To address this and advance ZSL, we propose a novel hierarchical semantic-visual adaptation (HSVA) framework. Specifically, HSVA aligns the semantic and visual domains by adopting a hierarchical two-step adaptation, i.e., structure adaptation and distribution adaptation. In the structure adaptation step, we take two task-specific encoders to encode the source data (visual domain) and the target data (semantic domain) into a structure-aligned common space. To this end, a supervised adversarial discrepancy (SAD) module is proposed to adversarially minimize the discrepancy between the predictions of two task-specific classifiers, thus making the visual and semantic feature manifolds more closely aligned. In the distribution adaptation step, we directly minimize the Wasserstein distance between the latent multivariate Gaussian distributions to align the visual and semantic distributions using a common encoder. Finally, the structure and distribution adaptation are derived in a unified framework under two partially-aligned variational autoencoders. Extensive experiments on four benchmark datasets demonstrate that HSVA achieves superior performance on both conventional and generalized ZSL. The code is available at \url{https://github.com/shiming-chen/HSVA} .

연구 동기 및 목표

Seen 클래스와 unseen 클래스를 향한 robust한 지식 전달의 동기 부여(한 단계의 분포 정렬을 넘어서).
시각적 및 의미적 특징의 이질성을 구조와 분포 변동을 함께 다루는 방식으로 해결.
다중 모달 데이터에 대해 판별 가능한 고유 공통 공간을 학습하는 통합적인 두 단계 프레임워크 제안.
다양한 데이터셋에서 CZSL 및 GZSL 벤치마크에 대해 우수한 성능 시현.

제안 방법

HSVA를 두 개의 부분적으로 정렬된 변분 오토인코더로 구성된 계층적 시맨틱-시각 적응(HSVA) 제안.
Structure adaptation (SA) via two task-specific encoders and supervised adversarial discrepancy to align manifolds.
Distribution adaptation (DA) by minimizing Wasserstein distance between latent Gaussian distributions with a common encoder.
Cross-reconstruction and VAE-based losses to maintain consistency across visual and semantic modalities.
Optimization combines VAE losses, cross-reconstruction, supervised classification, SAD, SWD-based discrepancies, and iCORAL for seen/unseen bias.
Classification in the learned distribution-aligned common space using reparameterized encodings.

실험 결과

연구 질문

RQ1Can a hierarchical two-step adaptation (structure then distribution) better align visual and semantic domains than one-step approaches in ZSL?
RQ2Does incorporating structure adaptation improve discriminativeness and reduce manifold misalignment between modalities?
RQ3How does distribution adaptation with a common encoder and Wasserstein distance affect seen/unseen bias in GZSL?
RQ4What is the impact of SA and DA components on CZSL and GZSL performance across standard benchmarks?

주요 결과

데이터셋	U (Unseen)	S (Seen)	H (조화 평균)
AWA1	59.3	76.6	66.8

HSVA achieves consistent improvements over existing common-space methods on CZSL across AWA1, CUB, and SUN datasets.
In GZSL, HSVA attains higher harmonic means than prior common-space methods on all four benchmarks, with notable gains on SUN.
Ablation shows SA and DA are both essential, with DA contributing large gains especially on coarser datasets.
iCORAL helps push unseen-class encodings away from seen-class regions, addressing seen-unseen bias.
Qualitative visualizations (t-SNE) indicate HSVA learns a more discriminative, intrinsic common space compared to one-step methods like CADA-VAE.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.