QUICK REVIEW

[論文レビュー] FastDSAC: Unlocking the Potential of Maximum Entropy RL in High-Dimensional Humanoid Control

Jun Xue, Junze Wang|arXiv (Cornell University)|Mar 13, 2026

Reinforcement Learning in Robotics被引用数 0

ひとこと要約

FastDSAC は、次元ごとのエントロピー調整と連続分布的クリティックを導入することにより、高次元 humanoid 制御の最大エントロピー強化学習を拡張可能にし、複雑なタスクで強力な性能を達成します。

ABSTRACT

Scaling Maximum Entropy Reinforcement Learning (RL) to high-dimensional humanoid control remains a formidable challenge, as the ``curse of dimensionality'' induces severe exploration inefficiency and training instability in expansive action spaces. Consequently, recent high-throughput paradigms have largely converged on deterministic policy gradients combined with massive parallel simulation. We challenge this compromise with FastDSAC, a framework that effectively unlocks the potential of maximum entropy stochastic policies for complex continuous control. We introduce Dimension-wise Entropy Modulation (DEM) to dynamically redistribute the exploration budget and enforce diversity, alongside a continuous distributional critic tailored to ensure value fidelity and mitigate high-dimensional value overestimation. Extensive evaluations on HumanoidBench and other continuous control tasks demonstrate that rigorously designed stochastic policies can consistently match or outperform deterministic baselines, achieving notable gains of 180\% and 400\% on the challenging extit{Basketball} and extit{Balance Hard} tasks.

研究の動機と目的

探索の非効率性と値過大評価にもかかわらず、高次元 humanoid 制御へ最大エントロピー RL のスケーリングを動機づける。
大規模なアクション空間における探索管理と値の忠実度向上のための機構を導入する。
複雑で高次元のタスクにおいて、確率政策が決定論的基準と同等かそれを上回ることを示す。

提案手法

探索予算をアクション次元ごとに再配分する Dimension-wise Entropy Modulation (DEM) を提案し、次元ごとのウェイトを Softmax で導く。
離散化誤差を回避し値の過大評価を緩和するため、ガウスとしてモデル化した連続的分布的クリティックを採用する。
DEM に基づく探索と連続的分布学習およびエントロピー正則化方策改善を組み合わせた Distributional Soft Policy Iteration (DSPI) ループを使用する。
トレーニングを安定化させ、クリティック更新を支えるために大規模バッチ・巨大並列環境を活用する。
目標エントロピーを満たすよう温度パラメータ alpha を調整し、探索を強化しつつ制御権を維持する。

実験結果

リサーチクエスチョン

RQ1FastDSAC は高次元 humanoid タスクで最先端の決定論的、確率的、オンポリシー、モデルベースのベースラインを上回るか。
RQ2DEM は高次元のアクション空間で解釈可能なタスク適合探索に必要か。
RQ3連続ガウス分布的クリティックはこの設定で離散的クリティック（C51 など）より安定性の利点を提供するか。
RQ4DEM の温度 tau は探索のスパース性とタスク間の性能にどう影響するか。

主な発見

FastDSAC は HumanoidBench、MuJoCo Playground、IsaacLab の39タスクで SOTA ベースラインと同等またはそれを上回る。
Basketball および Balance Hard タスクで、それぞれ約180%および400%のゲインを FastTD3 より達成。
DEM は自律的なサブスペース剪定を可能にし、探索をタスク関連の次元に集中させ、冗長なアクチュエータのノイズを抑制。
連続的なガウス分布的クリティックは量子化アーティファクトを低減し、離散的クリティックと比較して値の過大評価を緩和。
FastDSAC は複雑な協調と操作タスクで優れた性能を示し、さまざまなシミュレータで頑健性を維持。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。