QUICK REVIEW

[論文レビュー] ThermoAct:Thermal-Aware Vision-Language-Action Models for Robotic Perception and Decision-Making

Young-Chae Son, Dae-Kwan Ko|arXiv (Cornell University)|Mar 26, 2026

Multimodal Machine Learning Applications被引用数 0

ひとこと要約

ThermoActは熱カメラをVision-Language-Actionフレームワークに統合し、温度を考慮したタスクを分解する高度なVision-Language Modelプランナーと、行動を実行するVLAエグゼクターを用いて、熱対応シナリオでの安全性とタスク成功率を向上させる。

ABSTRACT

In recent human-robot collaboration environments, there is a growing focus on integrating diverse sensor data beyond visual information to enable safer and more intelligent task execution. Although thermal data can be crucial for enhancing robot safety and operational efficiency, its integration has been relatively overlooked in prior research. This paper proposes a novel Vision-Language-Action (VLA) framework that incorporates thermal information for robot task execution. The proposed system leverages a Vision-Language Model (VLM) as a high-level planner to interpret complex natural language commands and decompose them into simpler sub-tasks. This approach facilitates efficient data collection and robust reasoning for complex operations. Unlike conventional methods that rely solely on visual data, our approach integrates thermal information, enabling the robot to perceive physical properties and proactively ensure environmental safety. Experimental results from real-world task scenarios validate the feasibility of our proposed framework, suggesting its potential to enhance task success rates and safety compared to existing vision-based systems.

研究の動機と目的

熱透視データをVLAシステムに組み込むことで、より安全で温度を考慮したロボットタスク実行を動機づける。
Vision-Language Model (VLM)プランナーが熱入力で推論し、タスクを分解する階層的フレームワークを開発する。
限られたデモンストレーションから学習するVLAエグゼクターを通じて、熱認識的操作をデータ効率良く強化する。
現実世界のシナリオで熱情報がタスク成功、安定性、安全性に与える影響を評価する。

提案手法

VLMプランナー（高レベル推論）とVLAエグゼクター（低レベル制御）を備えたThermoActアーキテクチャを提案する。
VLMプランナーとVLAエグゼクターへの入力としてRGBと熱画像を10 Hz制御で融合する。
熱データ(256x192)を8ビットグレースケールに変換し、INFERNOパレットにマップして学習の知覚符号化を向上させる。
限られたデータで対応できるよう、各タスク50デモンストレーションでLoRAベースのファインチューニングを用いてVLAエグゼクターを訓練する。
実世界実験には2つのRGB-Dカメラと1つの熱カメラを備えた7自由度Kinova Gen3 Liteロボットを使用する。
階層的VLMベースプランニング（ThermoAct）と平坦なエンドツーエンドVLAベースラインを対比してデータ効率と性能を評価する。

Figure 1: We propose ThermoAct . (a) illustrates a VLM Planner that decomposes a high-level user instruction into specific sub-task descriptions . (b) depicts a VLA Executor that receives these descriptions as input prompts to predict low-level actions. By leveraging temperature cues from thermal im

実験結果

リサーチクエスチョン

RQ1熱に気づくVLMプランナーは、実世界の設定でVLAエグゼクターのための効果的なサブタスクへタスクを分解できるか。
RQ2熱情報を取り入れることで、RGBのみのベースラインと比較してデータが制約された条件下でのタスク成功と安全性が向上するか。
RQ3階層的プランニングアプローチは、熱対応操作のデータ効率とロバスト性においてエンドツーエンド学習より優れているか。
RQ4熱の手掛かりだけを用いる場合の限界は何か、モダリティ融合が性能と深度知覚にどう影響するか。

主な発見

Task Scenario	Sub-task	RGB-RGB	Ours (RGB-T)
Task 1: Bring warm water and an apple	pick up warm water from floor	-	40
Task 1: Bring warm water and an apple	place warm water to right side of empty plate	-	100
Task 1: Bring warm water and an apple	pick up an apple from fruit plate	-	70
Task 1: Bring warm water and an apple	place an apple on the empty plate	-	80
Task 2: Give me a cold Coke	pick up coke from floor	-	70
Task 2: Give me a cold Coke	place ice cup to right side of empty plate	-	90
Task 2: Give me a cold Coke	press button on ice maker	-	60
Task 2: Give me a cold Coke	pick up ice cup from ice maker	-	60
Task 2: Give me a cold Coke	place ice cup to right side of empty plate	-	60
Task 3: Select the Appropriate Cup for Each Object	pick up the scoop from floor	-	100
Task 3: Select the Appropriate Cup for Each Object	pour scoop into the coke/hot water	-	40
Task 4: pick up overheated battery from conveyor belt	pick up overheated battery	80	30
Task 5: Organize space near power strip	turn off hair straightener	-	30
Task 5: Organize space near power strip	pick up unplugged wire from floor	-	70
Task 5: Organize space near power strip	place unplugged wire to power strip	-	70

RGB-T入力を用いるThermoActは、熱関連のサブタスクでRGBのみのベースラインより性能を向上させ、限られた熱データでもデータ効率の利得を示す。
Task 1–5において、RGB-TはRGB-RGBよりサブタスク特異的な成功率が高く、温度依存タスク（例：温水、過熱したバッテリー、ヘアアイロンの電源を切る）で顕著な利得を示す。
ファインチューニングエピソードが30、50、70の場合、ThermoActの全体精度は熱タスクで約50–86%に安定し、データが増えるほど学習が改善される一方でRGBのみモデルと競合する。
階層的なVLMプランナーとVLAエグゼクターはロングホライズンのタスク実行を堅牢に可能にし、エンドツーエンド学習が苦戦する平坦なVLAアプローチを上回る（多くのケースで平坦なVLAはほぼゼロの成功率）。
熱情報は安全志向の意思決定を高め（例：熱い物体や危険な状態の認識）、移動するバッテリーのような動的シナリオにも一般化するが、深度知覚と視野の制約は依然課題である。

Figure 2: Hierarchical Collaboration between VLM Planner and VLA Executor. (a) The VLM Planner receives RGB-Thermal images and a structured guideline prompt containing role definitions and output examples. (b) Based on the thermal information, the VLM analyzes the environment context and decomposes

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。