QUICK REVIEW

[論文レビュー] Linear Adversarial Concept Erasure

Shauli Ravfogel, Michael Twiton|arXiv (Cornell University)|Jan 28, 2022

Adversarial Robustness in Machine Learning被引用数 23

ひとこと要約

線形ミニマックスフレームワークを導入し、事前学習表現から概念サブスペースを識別・消去し、R-LACEを提示。これは静的・文脈モデルのバイアスを効果的に緩和しつつ解釈性を保つ。

ABSTRACT

Modern neural models trained on textual data rely on pre-trained representations that emerge without direct supervision. As these representations are increasingly being used in real-world applications, the inability to \emph{control} their content becomes an increasingly important problem. We formulate the problem of identifying and erasing a linear subspace that corresponds to a given concept, in order to prevent linear predictors from recovering the concept. We model this problem as a constrained, linear maximin game, and show that existing solutions are generally not optimal for this task. We derive a closed-form solution for certain objectives, and propose a convex relaxation, \method, that works well for others. When evaluated in the context of binary gender removal, the method recovers a low-dimensional subspace whose removal mitigates bias by intrinsic and extrinsic evaluation. We show that the method is highly expressive, effectively mitigating bias in deep nonlinear classifiers while maintaining tractability and interpretability.

研究の動機と目的

Motivate and formalize post-hoc removal of a linear concept from fixed representations to prevent linear predictors from recovering the concept.
Define a constrained linear minimax game to identify a bias subspace and project representations onto its orthogonal complement.
Derive closed-form solutions for certain objectives and develop a convex relaxation (R-LACE) for classification tasks.
Evaluate gender bias removal in static (GloVe) and contextual (BERT) representations and analyze bias mitigation and task impact.

提案手法

Model the problem as an orthogonal projection based minimax game that neutralizes a rank-k subspace B via P = I_D − W^T W with WW^T = I_k.
Specialize to linear regression, partial least squares (Rayleigh quotient) and logistic regression, deriving closed-form solutions for regression and Rayleigh quotient cases.
Introduce R-LACE by convexifying the set of projection matrices to the Fantope, enabling gradient-based optimization for classification tasks.
Provide algorithms for alternating optimization between θ and P with projection onto the convex hull (Fantope) to solve the relaxed problem.

実験結果

リサーチクエスチョン

RQ1Can we identify a linear subspace that, when projected out, prevents linear predictors from recovering a target concept from fixed representations?
RQ2What is the best (smallest rank k) subspace to neutralize to maximize the loss on a given concept while preserving input information otherwise?
RQ3How does the proposed R-LACE relaxation perform for classification tasks compared to exact minimax solutions and to INLP?
RQ4Do linear concept erasure methods transfer effectiveness to deep nonlinear classifiers and real-world bias metrics?

主な発見

モデル	性別予測精度	職業予測精度	GAP_Male,y^{TPR,RMS}	sigma_{(GAP^{TPR},%Women)}
BERT-frozen	99.32	79.14	0.145	0.813
BERT-frozen + RLACE (rank 1)	52.48	78.86	0.109	0.680
BERT-frozen + RLACE (rank 100)	52.77	77.28	0.102	0.615
BERT-frozen + INLP (rank 1)	98.98	79.09	0.137	0.816
BERT-frozen + INLP (rank 100)	53.21	71.94	0.099	0.604
BERT-finetuned	96.89 ± 1.01	85.12 ± 0.08	0.123 ± 0.011	0.810 ± 0.023
BERT-finetuned + RLACE (rank 1)	54.59 ± 0.66	85.09 ± 0.07	0.117 ± 0.011	0.794 ± 0.025
BERT-finetuned + RLACE (rank 100)	54.33 ± 0.36	85.04 ± 0.09	0.115 ± 0.014	0.792 ± 0.025
BERT-finetuned + INLP (rank 1)	93.52 ± 1.42	85.12 ± 0.08	0.122 ± 0.011	0.808 ± 0.024
BERT-finetuned + INLP (rank 100)	53.04 ± 0.97	84.98 ± 0.06	0.113 ± 0.009	0.797 ± 0.027
BERT-adv (MLP adversary)	99.57 ± 0.05	84.87 ± 0.11	0.128 ± 0.004	0.840 ± 0.015
BERT-adv (Linear adversary)	99.23 ± 0.09	84.92 ± 0.12	0.124 ± 0.005	0.827 ± 0.012
Majority	53.52	30.0	-	-

A linear minimax formulation can identify a low-dimensional bias subspace whose removal reduces linear predictability of the concept.
Closed-form equilibria exist for linear regression and Rayleigh quotient (e.g., PLS) settings, with the optimal θ and P characterized analytically.
R-LACE, a convex relaxation, effectively solves classification-based concept erasure by alternating optimization over θ and P with projection onto the Fantope.
In gender-bias experiments, a rank-1 projection often suffices to neutralize linear gender information in GloVe, while preserving semantic content (SimLex-999) and allowing non-linear models to still predict gender.
R-LACE achieves substantial bias mitigation in both static and contextual embeddings, often outperforming INLP in requiring fewer iterations to reach similar or better bias reduction.
Experiments show that linear erasure can improve fairness metrics (e.g., reduced gender bias in downstream tasks) with modest impact on main-task performance in finetuned models.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。