[論文レビュー] How Should Pre-Trained Language Models Be Fine-Tuned Towards Adversarial Robustness?
本論文は Robust Informative Fine-Tuning (RIFT) を紹介する。情報理論的 adversarial fine-tuning 手法により忘却を緩和し、感情分析と自然言語推論における事前学習済み言語モデルのロバスト性を向上させる。
The fine-tuning of pre-trained language models has a great success in many NLP fields. Yet, it is strikingly vulnerable to adversarial examples, e.g., word substitution attacks using only synonyms can easily fool a BERT-based sentiment analysis model. In this paper, we demonstrate that adversarial training, the prevalent defense technique, does not directly fit a conventional fine-tuning scenario, because it suffers severely from catastrophic forgetting: failing to retain the generic and robust linguistic features that have already been captured by the pre-trained model. In this light, we propose Robust Informative Fine-Tuning (RIFT), a novel adversarial fine-tuning method from an information-theoretical perspective. In particular, RIFT encourages an objective model to retain the features learned from the pre-trained model throughout the entire fine-tuning process, whereas a conventional one only uses the pre-trained weights for initialization. Experimental results show that RIFT consistently outperforms the state-of-the-arts on two popular NLP tasks: sentiment analysis and natural language inference, under different attacks across various pre-trained language models.
研究の動機と目的
- Address catastrophic forgetting during fine-tuning of pre-trained language models under adversarial attacks.
- Propose an information-theoretic framework to retain robust, generic pre-trained features throughout fine-tuning.
- Improve robustness of downstream NLP tasks (sentiment analysis and natural language inference) against word-substitution adversarial attacks.
提案手法
- Introduce Robust Informative Fine-Tuning (RIFT) that maximizes mutual information between the objective model outputs and both the class label and the pre-trained model outputs.
- Decompose I(S;Y,T) into I(S;Y) + I(S;T|Y) to enable tractable optimization.
- Use variational bound to maximize I(S;Y) by cross-entropy loss with an invariant-prediction term between x and adversarial x.
- Apply a contrastive-like objective to maximize I(S;T|Y) via a noise-contrastive estimation lower bound with a class-conditioned score f_y.
- Generate adversarial examples x^ by solving KL-maximization under the current objective (self-supervised attack) to avoid label leakage.
- Combine L_r-task and L_r-info with a weighting alpha to control information absorption from the pre-trained model.
実験結果
リサーチクエスチョン
- RQ1Can adversarial fine-tuning cause intensified forgetting when fine-tuning pre-trained language models?
- RQ2Does an information-theoretic approach that preserves pre-trained information improve robustness without sacrificing vanilla accuracy?
- RQ3Is maximizing I(S;T|Y) more effective for downstream tasks than maximizing I(S;T) in the adversarial fine-tuning setting?
- RQ4What is the impact of the hyper-parameter alpha that governs information absorption from the pre-trained model?
- RQ5Do RIFT gains generalize across tasks (sentiment analysis and natural language inference) and architectures (BERT-base, RoBERTa-base) under word-substitution attacks?
主な発見
- RIFT consistently outperforms state-of-the-art adversarial fine-tuning methods on IMDB and SNLI under genetic and PWWS attacks.
- Maximizing I(S;T|Y) yields better robustness than maximizing I(S;T) for downstream tasks.
- Increasing alpha up to a point improves both robust accuracy and vanilla accuracy, indicating beneficial memorization of robust pre-trained features.
- RIFT offers robustness gains across different pre-trained backbones (BERT-base-uncased and RoBERTa-base).
- The trade-off curve shows RIFT improves both robust and standard accuracy, not just robustness at the expense of clean performance.
より良い研究を、今すぐ始めましょう
論文設計から論文執筆まで、研究時間を劇的に削減しましょう。
クレジットカード登録不要
このレビューはAIが作成し、人間の編集者が確認しました。