[Paper Review] Depthwise Convolution is All You Need for Learning Multiple Visual Domains
The paper proposes a multi-domain learning model using depthwise separable convolution with a shared pointwise branch and a domain-specific depthwise branch, achieving state-of-the-art Visual Decathlon results with about half the parameters of prior methods.
There is a growing interest in designing models that can deal with images from different visual domains. If there exists a universal structure in different visual domains that can be captured via a common parameterization, then we can use a single model for all domains rather than one model per domain. A model aware of the relationships between different domains can also be trained to work on new domains with less resources. However, to identify the reusable structure in a model is not easy. In this paper, we propose a multi-domain learning architecture based on depthwise separable convolution. The proposed approach is based on the assumption that images from different domains share cross-channel correlations but have domain-specific spatial correlations. The proposed model is compact and has minimal overhead when being applied to new domains. Additionally, we introduce a gating mechanism to promote soft sharing between different domains. We evaluate our approach on Visual Decathlon Challenge, a benchmark for testing the ability of multi-domain models. The experiments show that our approach can achieve the highest score while only requiring 50% of the parameters compared with the state-of-the-art approaches.
Motivation & Objective
- Identify reusable structures across visual domains to enable a single model for multiple domains.
- Propose a depthwise separable convolution-based architecture that separates cross-channel and spatial correlations.
- Enable efficient learning of new domains with minimal additional parameters via shared components and gating mechanisms.
- Investigate interpretability of learned features across depthwise and pointwise convolutions.
- Evaluate performance on the Visual Decathlon Challenge and compare against strong baselines.
Proposed method
- Replace standard 3x3 convolutions in a ResNet-26 backbone with depthwise separable convolutions (depthwise 3x3 followed by 1x1 pointwise) to reduce parameters.
- Share the pointwise convolution across domains to model cross-channel correlations.
- Maintain domain-specific depthwise filters and domain-specific batchnorm parameters for new domains.
- Stack depthwise filters for all domains during inference to compute domain-specific outputs.
- Introduce a soft-sharing gate for depthwise filters to softly combine domain-specific spatial correlations across layers.
- Initialize from ImageNet training and add domain-specific output heads for new domains while finetuning depthwise filters.
Experimental results
Research questions
- RQ1Can a single neural network capture universal cross-domain structure while allowing domain-specific spatial patterns?
- RQ2Does sharing pointwise (cross-channel) filters across domains yield better parameter efficiency and performance than sharing depthwise filters?
- RQ3How does a soft-sharing mechanism for depthwise filters affect performance across domains?
- RQ4What is the interpretability of features learned by depthwise versus pointwise convolutions in a multi-domain setting?
- RQ5How does the proposed approach perform on the Visual Decathlon Challenge compared to state-of-the-art baselines?
Key findings
- The proposed depthwise/separable architecture achieves the highest Visual Decathlon score among tested methods while using only about half the parameters of baselines.
- Replacing standard conv with depthwise separable conv in ResNet-26 substantially improves ImageNet performance (63.99 vs 60.32).
- Sharing pointwise filters (cross-channel) across domains yields competitive or superior performance to sharing depthwise filters, with overall gains and parameter efficiency.
- Domain-specific depthwise filters plus shared pointwise filters enable effective adaptation to new domains with modest parameter overhead (approximately 0.3M per new domain in an extended setup).
- Soft sharing of depthwise filters provides marginal gains on some domains but does not outperform the base approach overall; some gains observed when sharing early or late layers.
- Network dissection reveals depthwise convolutions capture higher-level concepts and more attributes than pointwise convolutions, indicating cross-domain sharing is more effective at the channel level than spatial filtering.
Better researchstarts right now
From paper design to paper writing, dramatically reduce your research time.
No credit card · Free plan available
This review was created by AI and reviewed by human editors.