QUICK REVIEW

[논문 리뷰] ThermoAct:Thermal-Aware Vision-Language-Action Models for Robotic Perception and Decision-Making

Young-Chae Son, Dae-Kwan Ko|arXiv (Cornell University)|2026. 03. 26.

Multimodal Machine Learning Applications인용 수 0

한 줄 요약

ThermoAct는 열 카메라를 Vision-Language-Action 프레임워크에 통합하고, 온도 인식 작업을 분해하는 고수준 비전-언어 모델 플래너와 실행기(VLA 실행기)를 사용하여 열 기반 시나리오에서 안전성과 작업 성공을 향상시킵니다.

ABSTRACT

In recent human-robot collaboration environments, there is a growing focus on integrating diverse sensor data beyond visual information to enable safer and more intelligent task execution. Although thermal data can be crucial for enhancing robot safety and operational efficiency, its integration has been relatively overlooked in prior research. This paper proposes a novel Vision-Language-Action (VLA) framework that incorporates thermal information for robot task execution. The proposed system leverages a Vision-Language Model (VLM) as a high-level planner to interpret complex natural language commands and decompose them into simpler sub-tasks. This approach facilitates efficient data collection and robust reasoning for complex operations. Unlike conventional methods that rely solely on visual data, our approach integrates thermal information, enabling the robot to perceive physical properties and proactively ensure environmental safety. Experimental results from real-world task scenarios validate the feasibility of our proposed framework, suggesting its potential to enhance task success rates and safety compared to existing vision-based systems.

연구 동기 및 목표

온도 인지 데이터를 VLA 시스템에 통합하여 로봇 작업 실행 시 안전하고 온도 인식적 특성을 강화하기.
Vision-Language Model (VLM) 플래너가 열 입력을 사용해 작업을 계층적으로 분해하도록 설계하는 프레임워크를 개발하기.
제한된 시연으로 학습된 VLA 실행기를 통해 열 인지 조작의 로버스트하고 데이터 효율적인 학습 enable하기.
실세계 시나리오에서 열 정보가 작업 성공, 안정성, 안전성에 미치는 영향 평가하기.

제안 방법

고수준 추론을 수행하는 VLM 플래너와 저수준 제어를 담당하는 VLA 실행기로 ThermoAct 아키텍처 제안.
VLM 플래너와 VLA 실행기에 RGB 및 열 이미지를 10 Hz 제어 속도로 입력으로 융합하기.
열 데이터를 (256x192)에서 8-비트 그레이스케일로 변환하고 INFERNO 팔레트로 매핑하여 학습을 위한 지각 인코딩 개선.
제한된 데이터 상황에서 50개의 시연으로 LoRA 기반 미세 조정을 통해 VLA 실행기 학습.
실제 로봇 실험에 2개의 RGB-D 카메라와 하나의 열 카메라를 갖춘 7-DoF Kinova Gen3 Lite 로봇 사용.
데이터 효율성과 성능을 평가하기 위해 계층적 VLM 기반 계획(ThermoAct)과 평면적 엔드-투-엔드 VLA 베이스라인을 대조.

Figure 1: We propose ThermoAct . (a) illustrates a VLM Planner that decomposes a high-level user instruction into specific sub-task descriptions . (b) depicts a VLA Executor that receives these descriptions as input prompts to predict low-level actions. By leveraging temperature cues from thermal im

실험 결과

연구 질문

RQ1열 인식 VLM 플래너가 실제 환경에서 VLA 실행기에 대해 효과적인 하위 작업으로 복잡한 작업을 분해할 수 있는가?
RQ2데이터 제약 조건 하에서도 RGB-만 사용한 베이스라인에 비해 열 정보를 도입하면 작업 성공 및 안전성이 개선되는가?
RQ3계층적 계획 접근이 열 기반 조작에서 엔드투엔드 학습보다 데이터 효율적이고 견고한가?
RQ4열 신호만 사용하는 한계는 무엇이며 모달리티 융합이 성능과 깊이 인식에 어떤 영향을 미치는가?

주요 결과

RGB-T 입력이 RGB-만 베이스라인에 비해 열 관련 하위 작업의 성능을 향상시키고 제한된 열 데이터에서도 데이터 효율적 이득을 보여준다.
Task 1–5에서 RGB-T가 RGB-RGB보다 더 높은 하위 작업 성공률을 달성하고 열 의존 작업(예: 따뜻한 물, 과열된 배터리, 헤어 고데기 끄기)에서 뚜렷한 이득을 보인다.
미세 조정 에피소드 30, 50, 70에서 ThermoAct의 전반적 정확도는 열 작업에서 약 50–86%로 안정화되며 더 많은 데이터로 학습이 향상되면서도 RGB-만 모델과의 경쟁력을 유지한다.
계층적 VLM 플래너와 VLA 실행기가 강건한 장기간 작업 실행을 가능하게 하며, 엔드-투-엔드 학습이 어려운 상황에서 평면형 VLA 접근법보다 우수하다(다수의 경우 평면형 VLA는 거의 성공하지 못함).
열 정보는 안전 지향적 의사결정을 강화하고(예: 뜨거운 물체 인식, 위험 상태), 움직이는 배터리와 같은 동적 시나리오로 일반화되지만 깊이 인식 및 시야각의 한계는 여전히 도전 과제로 남아 있다.

Figure 2: Hierarchical Collaboration between VLM Planner and VLA Executor. (a) The VLM Planner receives RGB-Thermal images and a structured guideline prompt containing role definitions and output examples. (b) Based on the thermal information, the VLM analyzes the environment context and decomposes

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.