QUICK REVIEW

[論文レビュー] Learning to select computations

Falk Lieder, Frederick Callaway|arXiv (Cornell University)|Jan 1, 2017

Reinforcement Learning in Robotics参考文献 20被引用数 2

ひとこと要約

本論文は、一時的価値と完全情報価値の間にある計算価値の性質に着目し、合理的なメタ推論の近似によって計算の選択を学習する、サンプル効率の高い強化学習アルゴリズムを提案する。3つのメタ推論タスク—停止、行動選択、計画—において、Meta-Greedy やブラインドエイド政策といった最先端のベースラインを上回るほぼ最適な性能を達成する。

ABSTRACT

Efficient use of limited computational resources is essential to intelligence. Selecting computations optimally according to rational metareasoning would achieve this, but rational metareasoning is computationally intractable. Inspired by psychology and neuroscience, we propose the first learning algorithm for approximating the optimal selection of computations. We derive a general, sample-efficient reinforcement learning algorithm for learning to select computations from the insight that the value of computation lies between the myopic value of computation and the value of perfect information. We evaluate the performance of our method against two state-of-the-art methods for approximate metareasoning--the meta-greedy heuristic and the blinkered policy--on three increasingly difficult metareasoning problems: metareasoning about when to terminate computation, metareasoning about how to choose between multiple actions, and metareasoning about planning. Across all three domains, our method achieved near-optimal performance and significantly outperformed the meta-greedy heuristic. The blinkered policy performed on par with our method in metareasoning about decision-making, but it is not directly applicable to metareasoning about planning where our method outperformed both the meta-greedy heuristic and a generalization of the blinkered policy. Our results are a step towards building self-improving AI systems that can learn to make optimal use of their limited computational resources to efficiently solve complex problems in real-time.

研究の動機と目的

知能システムにおける限られた計算リソースの効率的割り当てに取り組む。
合理的なメタ推論の計算的非効率性を、近似を学習することで克服する。
計算停止、行動選択、計画といった多様なメタ推論問題に適用可能な汎用的な手法を開発する。
Meta-Greedy やブラインドエイド政策といった既存の近似メタ推論手法の性能と適用範囲を向上させる。
リアルタイムで計算リソースの最適な使用法を学習できる自己改善型AIシステムを実現する。

提案手法

経験から計算選択方策を学ぶために強化学習を活用し、一時的価値と完全情報価値のバランスを取る価値関数を用いる。
計算の価値を、計算の即時の利益と完全な情報利得の間で境界づけることで、安定した学習を可能にする。
サンプル効率の高い強化学習アルゴリズムを、シミュレートされたメタ推論タスク上で訓練し、最適な意思決定方策を近似する。
学習した方策を、計算をいつ停止するか、どの行動を選ぶか、不確実性下での計画といった3つのドメインに適用する。
関数近似を用いて、複雑な意思決定空間における状態と行動の間で一般化を実現する。
環境との相互作用から得た経験をエンドツーエンドで訓練し、手作業によるヒューリスティクスに依存しない。

実験結果

リサーチクエスチョン

RQ1学習された強化学習方策は、既存のヒューリスティクス手法よりも、最適な計算選択をよりよく近似できるか？
RQ2提案手法は、停止、行動選択、計画といった異なるメタ推論問題において、どのように性能を発揮するか？
RQ3ブラインドエイド政策が適用できないドメイン、例えば計画においても、この手法は一般化できるか？
RQ4計算効率と解の品質の観点から、このアルゴリズムはどの程度ほぼ最適な性能を達成できるか？
RQ5一時的価値と完全情報価値の間をつなぐ価値関数の定式化は、安定的かつ効果的な学習を可能にするか？

主な発見

提案手法は、計算停止、行動選択、計画の3つのすべてのメタ推論タスクでほぼ最適な性能を達成した。
3つのドメインすべてにおいて、Meta-Greedy ヒューリスティクスを顕著に上回り、意思決定の質と効率性の両面で優れた性能を示した。
意思決定に関するメタ推論においては、ブラインドエイド政策と同等の性能を示したが、提案手法は計画においてその政策が失敗する分野にも効果的に一般化した。
計画タスクにおいて、ブラインドエイド政策の一般化版よりも提案手法が優れており、より広範な適用可能性を示した。
サンプル効率の高い強化学習フレームワークにより、限られた経験でも安定した学習が可能となり、リアルタイムでの展開を可能にした。
価値関数の定式化は、短期的利得と長期的情報的利益の両方を効果的にバランスさせ、頑健な方策学習を可能にした。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。