Skip to main content
QUICK REVIEW

[Paper Review] Depthwise Convolution is All You Need for Learning Multiple Visual Domains

Yunhui Guo, Yandong Li|arXiv (Cornell University)|Feb 3, 2019
Domain Adaptation and Few-Shot Learning36 references32 citations
TL;DR

The paper proposes a multi-domain learning model using depthwise separable convolution with a shared pointwise branch and a domain-specific depthwise branch, achieving state-of-the-art Visual Decathlon results with about half the parameters of prior methods.

ABSTRACT

There is a growing interest in designing models that can deal with images from different visual domains. If there exists a universal structure in different visual domains that can be captured via a common parameterization, then we can use a single model for all domains rather than one model per domain. A model aware of the relationships between different domains can also be trained to work on new domains with less resources. However, to identify the reusable structure in a model is not easy. In this paper, we propose a multi-domain learning architecture based on depthwise separable convolution. The proposed approach is based on the assumption that images from different domains share cross-channel correlations but have domain-specific spatial correlations. The proposed model is compact and has minimal overhead when being applied to new domains. Additionally, we introduce a gating mechanism to promote soft sharing between different domains. We evaluate our approach on Visual Decathlon Challenge, a benchmark for testing the ability of multi-domain models. The experiments show that our approach can achieve the highest score while only requiring 50% of the parameters compared with the state-of-the-art approaches.

Motivation & Objective

  • Identify reusable structures across visual domains to enable a single model for multiple domains.
  • Propose a depthwise separable convolution-based architecture that separates cross-channel and spatial correlations.
  • Enable efficient learning of new domains with minimal additional parameters via shared components and gating mechanisms.
  • Investigate interpretability of learned features across depthwise and pointwise convolutions.
  • Evaluate performance on the Visual Decathlon Challenge and compare against strong baselines.

Proposed method

  • Replace standard 3x3 convolutions in a ResNet-26 backbone with depthwise separable convolutions (depthwise 3x3 followed by 1x1 pointwise) to reduce parameters.
  • Share the pointwise convolution across domains to model cross-channel correlations.
  • Maintain domain-specific depthwise filters and domain-specific batchnorm parameters for new domains.
  • Stack depthwise filters for all domains during inference to compute domain-specific outputs.
  • Introduce a soft-sharing gate for depthwise filters to softly combine domain-specific spatial correlations across layers.
  • Initialize from ImageNet training and add domain-specific output heads for new domains while finetuning depthwise filters.

Experimental results

Research questions

  • RQ1Can a single neural network capture universal cross-domain structure while allowing domain-specific spatial patterns?
  • RQ2Does sharing pointwise (cross-channel) filters across domains yield better parameter efficiency and performance than sharing depthwise filters?
  • RQ3How does a soft-sharing mechanism for depthwise filters affect performance across domains?
  • RQ4What is the interpretability of features learned by depthwise versus pointwise convolutions in a multi-domain setting?
  • RQ5How does the proposed approach perform on the Visual Decathlon Challenge compared to state-of-the-art baselines?

Key findings

  • The proposed depthwise/separable architecture achieves the highest Visual Decathlon score among tested methods while using only about half the parameters of baselines.
  • Replacing standard conv with depthwise separable conv in ResNet-26 substantially improves ImageNet performance (63.99 vs 60.32).
  • Sharing pointwise filters (cross-channel) across domains yields competitive or superior performance to sharing depthwise filters, with overall gains and parameter efficiency.
  • Domain-specific depthwise filters plus shared pointwise filters enable effective adaptation to new domains with modest parameter overhead (approximately 0.3M per new domain in an extended setup).
  • Soft sharing of depthwise filters provides marginal gains on some domains but does not outperform the base approach overall; some gains observed when sharing early or late layers.
  • Network dissection reveals depthwise convolutions capture higher-level concepts and more attributes than pointwise convolutions, indicating cross-domain sharing is more effective at the channel level than spatial filtering.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.