QUICK REVIEW

[论文解读] Policy Poisoning in Batch Reinforcement Learning and Control

Yuzhe Ma, Xuezhou Zhang|arXiv (Cornell University)|Oct 13, 2019

Adversarial Robustness in Machine Learning被引用 43

一句话总结

本文提出一个统一的、凸优化框架，用于数据中毒攻击，通过最小化训练奖励的扰动，使批量强化学习与控制学习者采用攻击者选择的策略；并在表格确定等价（TCE）和线性二次调控器（LQR）受害者上实例化和分析该攻击，并在实验中证明其有效性。

ABSTRACT

We study a security threat to batch reinforcement learning and control where the attacker aims to poison the learned policy. The victim is a reinforcement learner / controller which first estimates the dynamics and the rewards from a batch data set, and then solves for the optimal policy with respect to the estimates. The attacker can modify the data set slightly before learning happens, and wants to force the learner into learning a target policy chosen by the attacker. We present a unified framework for solving batch policy poisoning attacks, and instantiate the attack on two standard victims: tabular certainty equivalence learner in reinforcement learning and linear quadratic regulator in control. We show that both instantiation result in a convex optimization problem on which global optimality is guaranteed, and provide analysis on attack feasibility and attack cost. Experiments show the effectiveness of policy poisoning attacks.

研究动机与目标

Motivate and formalize a data-poisoning threat to batch RL and control learners that estimate dynamics and rewards from a batch dataset.
Develop a unified optimization framework that guarantees tractability and global optimality for poisoning attacks.
Instantiate and analyze the attack against two representative victims: tabular certainty equivalence (TCE) and linear quadratic regulator (LQR).
Provide theoretical insight into attack feasibility and cost, and validate effectiveness through experiments.

提出的方法

Formulate a bi-level attack: modify the training rewards to force learning of a target policy while minimizing a chosen norm of reward changes.
Recast the attack as a convex optimization problem using an epsilon-robust target Q-polytope to ensure a unique target policy.
For TCE, express the estimated model by maximum likelihood for P and least-squares for R, then impose Bellman consistency with the target policy to obtain a convex program (with linear constraints).
Prove attack feasibility and derive bounds on the minimal attack cost as a function of the suboptimality gap Δ(Δ).
For LQR, model the batch identification as least-squares, enforce a surrogate Riccati-based structure, and derive a convex surrogate attack by relaxing SDP constraints to a tractable form.
Provide empirical demonstrations showing that small reward perturbations can compel the learner to adopt the attacker’s target policy.

实验结果

研究问题

RQ1Can a batch RL or control learner be compelled to learn a target policy by minimally perturbing the training rewards?
RQ2Is the policy-poisoning optimization tractable (convex) for common batch-learning victims such as TCE and LQR?
RQ3What are the theoretical feasibility guarantees and cost bounds of policy poisoning in batch RL/control setups?
RQ4Do practical experiments show that small reward changes suffice to steer learning toward attacker-specified policies?

主要发现

Policy poisoning attacks are feasible and can be formulated as convex optimization problems with global optima.
For TCE, feasible attacks exist for any target policy and the attack cost scales with the suboptimality gap Δ(Δ).
Attack cost bounds imply linear scaling in T for Δ=1, sqrt(T) for Δ=2, and constant for Δ=3, making sparse attacks possible when Δ=1.
For LQR, attacker-chosen target policies consistent with Riccati equations can be induced by small reward perturbations; the attack cost remains small relative to the original data.
Experiments demonstrate that the attacker can force the learner to follow the target policy while only modestly perturbing rewards, and sparse attacks (alpha=1) are possible.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。