QUICK REVIEW

[論文レビュー] Improving Policy Exploitation in Online Reinforcement Learning with Instant Retrospect Action

Gong Gao, Yu Fu|arXiv (Cornell University)|Jan 27, 2026

Reinforcement Learning in Robotics被引用数 0

ひとこと要約

IRAはQ表現の不一致、貪欲な行動誘導、即時ポリシー更新を組み合わせてオンライン価値ベースRLのポリシー搾取を強化し、MuJoCoタスクで性能を向上させます。

ABSTRACT

Existing value-based online reinforcement learning (RL) algorithms suffer from slow policy exploitation due to ineffective exploration and delayed policy updates. To address these challenges, we propose an algorithm called Instant Retrospect Action (IRA). Specifically, we propose Q-Representation Discrepancy Evolution (RDE) to facilitate Q-network representation learning, enabling discriminative representations for neighboring state-action pairs. In addition, we adopt an explicit method to policy constraints by enabling Greedy Action Guidance (GAG). This is achieved through backtracking historical actions, which effectively enhances the policy update process. Our proposed method relies on providing the learning algorithm with accurate $k$-nearest-neighbor action value estimates and learning to design a fast-adaptable policy through policy constraints. We further propose the Instant Policy Update (IPU) mechanism, which enhances policy exploitation by systematically increasing the frequency of policy updates. We further discover that the early-stage training conservatism of the IRA method can alleviate the overestimation bias problem in value-based RL. Experimental results show that IRA can significantly improve the learning efficiency and final performance of online RL algorithms on eight MuJoCo continuous control tasks.The code is available at https://github.com/2706853499/IRA.

研究の動機と目的

オンライン値ベースRLにおける遅いポリシー搾取の動機づけと探索遅延の解決。
Q表現学習とポリシー更新効率を改善するためのIRAの導入。
表現の不一致、明示的なポリシー指針、即時更新を組み合わせて学習を強化。
八つのMuJoCo連続制御タスクでの改善を実証。

提案手法

最近傍k個の行動を取得し、チェビシェフ距離でランク付けして最適・劣後の近傍を特定する。
現在のポリシー行動と劣後行動間の表現差を拡大するRDE（Q表現の不一致進化）損失を導入。
最適近傍の行動に向けたポリシー更新を制約して貪欲な行動指針を課す（ポリシー制約の強さmu）。
アクター更新頻度を高める即時ポリシー更新機構を実装（d = 1）。
RDEとGreedy Action GuidanceをTD3/DDPGベースの枠組みに統合し、安定した価値推定のためにダブルQターゲットネットワークを使用。
IRAをTD3、DDPG、PPO、ALH、PEER、MBPOと8つのMuJoCoタスクで評価。

Figure 1: We introduce auxiliary signals to enhance learning capability and propose two core mechanisms: integrating representation-guided signals into Q-learning and introducing anchor points for policy updates.

実験結果

リサーチクエスチョン

RQ1IRAはオンライン値ベースRLにおけるポリシー搾取速度と最終性能を向上させるか。
RQ2Q表現の不一致は行動の識別性と学習効率にどのように影響するか。
RQ3貪欲な行動指針は過剰推定バイアスを緩和する安定した制約付き探索を提供できるか。
RQ4即時ポリシー更新が連続制御タスク全体の学習速度と性能に与える影響は何か。

主な発見

Task	IRA	ALH	PEER	TD3	DDPG	PPO	MBPO	Avg (normalized)
HalfCheetah-v3	9832 ± 517	7202 ± 527	7456 ± 375	7442 ± 477	6438 ± 282	334 ± 68	6782 ± 554	98.7
Hopper-v3	3412 ± 117	2993 ± 402	2722 ± 368	3079 ± 260	2114 ± 389	2068 ± 333	2671 ± 476	81.3
Walker2d-v3	3886 ± 193	4013 ± 177	3605 ± 494	3464 ± 478	2379 ± 420	791 ± 67	1389 ± 491	72.8
Ant-v3	5115 ± 213	4616 ± 505	4360 ± 450	4692 ± 377	1031 ± 262	13 ± 17	1739 ± 382	135.6
Humanoid-v3	4963 ± 166	4742 ± 229	4613 ± 205	4843 ± 203	258 ± 54	394 ± 15	406 ± 60	194.6
Reacher-v2	-4 ± 0	-6 ± 0	-7 ± 0	-6 ± 0	-10 ± 1	-5 ± 0	-30 ± 12	-
InvertedDouble-v2	9203 ± 178	8320 ± 1408	8319 ± 1407	7082 ± 1918	8749 ± 509	8506 ± 397	9359 ± 1	-
InvertedPendulum-v2	1000 ± 0	980 ± 24	983 ± 25	987 ± 20	982 ± 27	977 ± 23	965 ± 53	-

IRAは8つのMuJoCoタスクでベースのTD3を大幅に上回り、平均36.9%の改善を達成。
HalfCheetah、Hopper、Ant、Humanoid、InvertedPendulumなどのタスクでALH、PEER、TD3、DDPG、PPO、MBPOに対する絶対性能向上を示す。
IRAはTD3に比べて安定性（分散の低下）を示し、堅牢な性能向上を示す。
RDEは近隣行動の表現識別性を高め、貪欲な行動指針を強化。
IRAの初期段階での保守性はQ値の過大評価バイアスを緩和するのに役立つ。

Figure 2: Images for eight MuJoCo environments used in our experiments.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。