QUICK REVIEW

[Paper Review] Depthwise Convolution is All You Need for Learning Multiple Visual Domains

Yunhui Guo, Yandong Li|arXiv (Cornell University)|Feb 3, 2019

Domain Adaptation and Few-Shot Learning36 references32 citations

TL;DR

The paper proposes a multi-domain learning model using depthwise separable convolution with a shared pointwise branch and a domain-specific depthwise branch, achieving state-of-the-art Visual Decathlon results with about half the parameters of prior methods.

ABSTRACT

There is a growing interest in designing models that can deal with images from different visual domains. If there exists a universal structure in different visual domains that can be captured via a common parameterization, then we can use a single model for all domains rather than one model per domain. A model aware of the relationships between different domains can also be trained to work on new domains with less resources. However, to identify the reusable structure in a model is not easy. In this paper, we propose a multi-domain learning architecture based on depthwise separable convolution. The proposed approach is based on the assumption that images from different domains share cross-channel correlations but have domain-specific spatial correlations. The proposed model is compact and has minimal overhead when being applied to new domains. Additionally, we introduce a gating mechanism to promote soft sharing between different domains. We evaluate our approach on Visual Decathlon Challenge, a benchmark for testing the ability of multi-domain models. The experiments show that our approach can achieve the highest score while only requiring 50% of the parameters compared with the state-of-the-art approaches.

Motivation & Objective

Identify reusable structures across visual domains to enable a single model for multiple domains.
Propose a depthwise separable convolution-based architecture that separates cross-channel and spatial correlations.
Enable efficient learning of new domains with minimal additional parameters via shared components and gating mechanisms.
Investigate interpretability of learned features across depthwise and pointwise convolutions.
Evaluate performance on the Visual Decathlon Challenge and compare against strong baselines.

Proposed method

Replace standard 3x3 convolutions in a ResNet-26 backbone with depthwise separable convolutions (depthwise 3x3 followed by 1x1 pointwise) to reduce parameters.
Share the pointwise convolution across domains to model cross-channel correlations.
Maintain domain-specific depthwise filters and domain-specific batchnorm parameters for new domains.
Stack depthwise filters for all domains during inference to compute domain-specific outputs.
Introduce a soft-sharing gate for depthwise filters to softly combine domain-specific spatial correlations across layers.
Initialize from ImageNet training and add domain-specific output heads for new domains while finetuning depthwise filters.

Experimental results

Research questions

RQ1Can a single neural network capture universal cross-domain structure while allowing domain-specific spatial patterns?
RQ2Does sharing pointwise (cross-channel) filters across domains yield better parameter efficiency and performance than sharing depthwise filters?
RQ3How does a soft-sharing mechanism for depthwise filters affect performance across domains?
RQ4What is the interpretability of features learned by depthwise versus pointwise convolutions in a multi-domain setting?
RQ5How does the proposed approach perform on the Visual Decathlon Challenge compared to state-of-the-art baselines?

Key findings

The proposed depthwise/separable architecture achieves the highest Visual Decathlon score among tested methods while using only about half the parameters of baselines.
Replacing standard conv with depthwise separable conv in ResNet-26 substantially improves ImageNet performance (63.99 vs 60.32).
Sharing pointwise filters (cross-channel) across domains yields competitive or superior performance to sharing depthwise filters, with overall gains and parameter efficiency.
Domain-specific depthwise filters plus shared pointwise filters enable effective adaptation to new domains with modest parameter overhead (approximately 0.3M per new domain in an extended setup).
Soft sharing of depthwise filters provides marginal gains on some domains but does not outperform the base approach overall; some gains observed when sharing early or late layers.
Network dissection reveals depthwise convolutions capture higher-level concepts and more attributes than pointwise convolutions, indicating cross-domain sharing is more effective at the channel level than spatial filtering.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.