QUICK REVIEW

[Paper Review] End-to-End Multi-Task Learning with Attention

Shikun Liu, Edward Johns|arXiv (Cornell University)|Mar 28, 2018

Advanced Neural Network Applications30 references20 citations

TL;DR

This paper proposes the Multi-Task Attention Network (MTAN), a parameter-efficient, end-to-end multi-task learning architecture that uses task-specific soft-attention modules to dynamically select features from a shared global feature pool. MTAN achieves state-of-the-art performance across image segmentation, depth estimation, and image classification tasks while showing robustness to loss weighting schemes and reduced parameter count compared to prior methods.

ABSTRACT

We propose a novel multi-task learning architecture, which allows learning of task-specific feature-level attention. Our design, the Multi-Task Attention Network (MTAN), consists of a single shared network containing a global feature pool, together with a soft-attention module for each task. These modules allow for learning of task-specific features from the global features, whilst simultaneously allowing for features to be shared across different tasks. The architecture can be trained end-to-end and can be built upon any feed-forward neural network, is simple to implement, and is parameter efficient. We evaluate our approach on a variety of datasets, across both image-to-image predictions and image classification tasks. We show that our architecture is state-of-the-art in multi-task learning compared to existing methods, and is also less sensitive to various weighting schemes in the multi-task loss function. Code is available at https://github.com/lorenmt/mtan.

Motivation & Objective

To address the dual challenges of effective feature sharing and loss balancing in multi-task learning.
To design a unified architecture that automatically learns both task-shared and task-specific features without manual intervention.
To improve parameter efficiency and scalability in multi-task networks, especially as the number of tasks increases.
To reduce sensitivity to hyperparameter tuning of loss weights, which often hinders training stability in multi-task setups.
To achieve state-of-the-art performance across diverse multi-task benchmarks, including dense prediction and image classification tasks.

Proposed method

The architecture uses a single shared backbone network to produce a global feature pool from input data.
For each task, a soft-attention module is applied at each convolutional block to reweight shared features based on task relevance.
Attention masks are differentiable and trained end-to-end, enabling automatic selection of task-specific features from the shared representation.
The method is compatible with any feed-forward neural network, such as SegNet or Wide ResNet, allowing flexible backbone integration.
A novel Dynamic Weight Average (DWA) loss weighting scheme is proposed, which adapts task weights based on the rate of change of each task's loss.
The network is trained end-to-end using standard optimization, with no need for task-specific head separation or complex regularization.

Experimental results

Research questions

RQ1Can a multi-task learning architecture automatically learn both shared and task-specific features without explicit architectural separation?
RQ2How does attention-based feature selection improve performance and robustness compared to fixed feature sharing in multi-task networks?
RQ3To what extent does the proposed method reduce sensitivity to loss weighting hyperparameters in multi-task training?
RQ4Can the architecture maintain high performance while being significantly more parameter-efficient than existing multi-task networks?
RQ5Does the method generalize across diverse tasks, including dense prediction and image classification, on benchmark datasets?

Key findings

MTAN achieves state-of-the-art performance on the CityScapes dataset for semantic segmentation, depth estimation, and surface normal prediction, with a 2941-parameter count (2.9x smaller than single-task baseline).
On the Visual Decathlon Challenge, MTAN achieves a cumulative score of 96.88 out of 1000 per task (96.88% of maximum), surpassing most baselines and matching state-of-the-art performance without complex regularization.
The method shows greater performance gain with increasing task complexity, outperforming single-task attention networks (STAN) on multi-task setups, especially for complex tasks.
Attention masks visually demonstrate task-specific feature selection, with depth tasks showing higher contrast masks, indicating stronger reliance on task-specific features.
MTAN is robust to various loss weighting schemes, including the proposed Dynamic Weight Average (DWA), which improves training stability and convergence.
The architecture is highly parameter-efficient, with only 2941 parameters for 10 tasks on the Visual Decathlon, significantly reducing model size compared to methods with explicit task-specific branches.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.