[Paper Review] End-to-End Multi-Task Learning with Attention
This paper proposes the Multi-Task Attention Network (MTAN), a parameter-efficient, end-to-end multi-task learning architecture that uses task-specific soft-attention modules to dynamically select features from a shared global feature pool. MTAN achieves state-of-the-art performance across image segmentation, depth estimation, and image classification tasks while showing robustness to loss weighting schemes and reduced parameter count compared to prior methods.
We propose a novel multi-task learning architecture, which allows learning of task-specific feature-level attention. Our design, the Multi-Task Attention Network (MTAN), consists of a single shared network containing a global feature pool, together with a soft-attention module for each task. These modules allow for learning of task-specific features from the global features, whilst simultaneously allowing for features to be shared across different tasks. The architecture can be trained end-to-end and can be built upon any feed-forward neural network, is simple to implement, and is parameter efficient. We evaluate our approach on a variety of datasets, across both image-to-image predictions and image classification tasks. We show that our architecture is state-of-the-art in multi-task learning compared to existing methods, and is also less sensitive to various weighting schemes in the multi-task loss function. Code is available at https://github.com/lorenmt/mtan.
Motivation & Objective
- To address the dual challenges of effective feature sharing and loss balancing in multi-task learning.
- To design a unified architecture that automatically learns both task-shared and task-specific features without manual intervention.
- To improve parameter efficiency and scalability in multi-task networks, especially as the number of tasks increases.
- To reduce sensitivity to hyperparameter tuning of loss weights, which often hinders training stability in multi-task setups.
- To achieve state-of-the-art performance across diverse multi-task benchmarks, including dense prediction and image classification tasks.
Proposed method
- The architecture uses a single shared backbone network to produce a global feature pool from input data.
- For each task, a soft-attention module is applied at each convolutional block to reweight shared features based on task relevance.
- Attention masks are differentiable and trained end-to-end, enabling automatic selection of task-specific features from the shared representation.
- The method is compatible with any feed-forward neural network, such as SegNet or Wide ResNet, allowing flexible backbone integration.
- A novel Dynamic Weight Average (DWA) loss weighting scheme is proposed, which adapts task weights based on the rate of change of each task's loss.
- The network is trained end-to-end using standard optimization, with no need for task-specific head separation or complex regularization.
Experimental results
Research questions
- RQ1Can a multi-task learning architecture automatically learn both shared and task-specific features without explicit architectural separation?
- RQ2How does attention-based feature selection improve performance and robustness compared to fixed feature sharing in multi-task networks?
- RQ3To what extent does the proposed method reduce sensitivity to loss weighting hyperparameters in multi-task training?
- RQ4Can the architecture maintain high performance while being significantly more parameter-efficient than existing multi-task networks?
- RQ5Does the method generalize across diverse tasks, including dense prediction and image classification, on benchmark datasets?
Key findings
- MTAN achieves state-of-the-art performance on the CityScapes dataset for semantic segmentation, depth estimation, and surface normal prediction, with a 2941-parameter count (2.9x smaller than single-task baseline).
- On the Visual Decathlon Challenge, MTAN achieves a cumulative score of 96.88 out of 1000 per task (96.88% of maximum), surpassing most baselines and matching state-of-the-art performance without complex regularization.
- The method shows greater performance gain with increasing task complexity, outperforming single-task attention networks (STAN) on multi-task setups, especially for complex tasks.
- Attention masks visually demonstrate task-specific feature selection, with depth tasks showing higher contrast masks, indicating stronger reliance on task-specific features.
- MTAN is robust to various loss weighting schemes, including the proposed Dynamic Weight Average (DWA), which improves training stability and convergence.
- The architecture is highly parameter-efficient, with only 2941 parameters for 10 tasks on the Visual Decathlon, significantly reducing model size compared to methods with explicit task-specific branches.
Better researchstarts right now
From paper design to paper writing, dramatically reduce your research time.
No credit card · Free plan available
This review was created by AI and reviewed by human editors.