[Paper Review] Multi-Head Attention with Disagreement Regularization
This paper proposes disagreement regularization to enhance multi-head attention by explicitly encouraging diversity among attention heads. By applying three types of regularization—on subspaces, attended positions, and output representations—on Transformer models, the method improves translation performance across English-German and Chinese-English tasks, with Transformer-Base achieving near-Transformer-Big performance at nearly double the training speed.
Multi-head attention is appealing for the ability to jointly attend to information from different representation subspaces at different positions. In this work, we introduce a disagreement regularization to explicitly encourage the diversity among multiple attention heads. Specifically, we propose three types of disagreement regularization, which respectively encourage the subspace, the attended positions, and the output representation associated with each attention head to be different from other heads. Experimental results on widely-used WMT14 English-German and WMT17 Chinese-English translation tasks demonstrate the effectiveness and universality of the proposed approach.
Motivation & Objective
- To address the lack of explicit diversity enforcement among multi-head attention heads in Transformers.
- To improve neural machine translation performance by encouraging each attention head to learn distinct features.
- To investigate whether explicitly regularizing attention head disagreement enhances model generalization and efficiency.
- To evaluate the effectiveness of three distinct disagreement regularization types on different components of multi-head attention.
- To demonstrate that a smaller model (Transformer-Base) with disagreement regularization can match the performance of a larger model (Transformer-Big) while training significantly faster.
Proposed method
- Introduces an auxiliary training objective that combines likelihood loss with a disagreement regularization term, controlled by a hyperparameter λ=1.0.
- Proposes three types of disagreement regularization: on projected subspaces (V^i, V^j), on attended positions (via element-wise multiplication of attention matrices), and on output representations (O^i, O^j).
- Uses cosine distance as the disagreement metric to maximize dissimilarity between head representations across subspaces, attended positions, and output vectors.
- Applies regularization terms independently or in combination to the multi-head attention mechanism within the Transformer architecture.
- Employs the standard Transformer encoder-decoder framework with multi-head self-attention, and integrates the disagreement regularization during training without adding new parameters.
- Measures disagreement using exp(D) for interpretability, where higher values (up to 1.0) indicate greater orthogonality (diversity) between heads.
Experimental results
Research questions
- RQ1Does explicitly regularizing attention head disagreement improve neural machine translation performance?
- RQ2Which component of the multi-head attention mechanism—subspace, attended positions, or output representation—is most effective to regularize for improved performance?
- RQ3Can a smaller Transformer model (Base) achieve performance comparable to a larger model (Big) through disagreement regularization?
- RQ4To what extent do standard multi-head attention heads attend to the same positions, and does this limit their representational diversity?
- RQ5How does disagreement regularization affect the learned representations across different encoder layers?
Key findings
- Disagreement regularization consistently improves translation performance on both WMT14 English-to-German and WMT17 Chinese-to-English tasks.
- Transformer-Base with disagreement regularization achieves performance comparable to Transformer-Big, while training nearly twice as fast.
- The Output disagreement regularization achieves the highest disagreement score (exp(D) ≈ 0.997), indicating near-perpendicular output vectors across heads.
- Baseline multi-head attention shows minimal disagreement on attended positions (exp(D) = 0.007), indicating most heads attend to the same positions.
- Position-based regularization does not significantly increase disagreement on subspaces or outputs, explaining its limited effectiveness when combined with other terms.
- The results suggest that multi-head attention primarily encodes head differences in learned representations rather than in attended positions, challenging assumptions about positional diversity.
Better researchstarts right now
From paper design to paper writing, dramatically reduce your research time.
No credit card · Free plan available
This review was created by AI and reviewed by human editors.