QUICK REVIEW

[論文レビュー] HomeGuard: VLM-based Embodied Safeguard for Identifying Contextual Risk in Household Task

Xiaoya Lu, Yijin Zhou|arXiv (Cornell University)|Mar 15, 2026

Multimodal Machine Learning Applications被引用数 0

ひとこと要約

HomeGuardはCG-CoT groundingを視覚的アンカーと2段階トレーニング（SFTとRFT）で強化し、家庭内タスクにおけるVLMの安全性を向上させ、リスク識別と根拠付けられた危険 localizationを実現します。

ABSTRACT

Vision-Language Models (VLMs) empower embodied agents to execute complex instructions, yet they remain vulnerable to contextual safety risks where benign commands become hazardous due to subtle environmental states. Existing safeguards often prove inadequate. Rule-based methods lack scalability in object-dense scenes, whereas model-based approaches relying on prompt engineering suffer from unfocused perception, resulting in missed risks or hallucinations. To address this, we propose an architecture-agnostic safeguard featuring Context-Guided Chain-of-Thought (CG-CoT). This mechanism decomposes risk assessment into active perception that sequentially anchors attention to interaction targets and relevant spatial neighborhoods, followed by semantic judgment based on this visual evidence. We support this approach with a curated grounding dataset and a two-stage training strategy utilizing Reinforcement Fine-Tuning (RFT) with process rewards to enforce precise intermediate grounding. Experiments demonstrate that our model HomeGuard significantly enhances safety, improving risk match rates by over 30% compared to base models while reducing oversafety. Beyond hazard detection, the generated visual anchors serve as actionable spatial constraints for downstream planners, facilitating explicit collision avoidance and safety trajectory generation. Code and data are released under https://github.com/AI45Lab/HomeGuard

研究の動機と目的

乱雑な家庭環境で文脈リスクが微妙なシーン状態から生まれる際の具現化VLMの堅牢な安全性を動機づける。
リスク評価を相互作用ターゲットと背景制約に基づいて地固めするContext-Guided Chain-of-Thought（CG-CoT）を提案する。
細粒度のグラウンディングと反事実の安全ペアを含むHomeSafeデータセットを作成し、グラウンディングベースの安全対策を訓練・評価する。
中間グラウンディングの精度を強制するプロセス報酬を持つ2段階トレーニングパイプライン（SFTとRFT）を開発する。

提案手法

CG-CoTを導入し、知覚（グラウンディングターゲットと制約）と意味的リスク判断を分離する。
相互作用ターゲットと制約領域のバウンディングボックスを定義して安全推論を地固めする。
4段階の推論プロセス：指示意図のスクリーニング、相互作用ターゲットの検査、環境制約の分析、統合リスク評価。
安全構造を符号化するLoRAベースのSFTを訓練し、その後GRPOベースのRFTをプロセス報酬とアウトカム報酬で適用してグラウンディング精度を高める。
denseなアノテーションと反事実の安全ペアを含む10,257のunsafe / 5,710のsafeデータセットで、教師あり学習と強化学習を活用する。

Figure 1 : Identifying implicit contextual risks via Context-Guided Chain-of-Thought.

実験結果

リサーチクエスチョン

RQ1CG-CoTを明示的な視覚グラウンディングと組み合わせると、VLMベースの具現化エージェントにおけるリスク識別精度は向上するか。
RQ2プロセス報酬を伴うSFT+RFTという2段階トレーニングは、中間グラウンディングを高め、過度の安全回避を抑制するか。
RQ3HomeGuardは外部の安全ベンチマークや公開リスク識別データセットへ一般化できるか。
RQ4視覚的アンカーは実用的な空間制約を可能にし、下流の計画と安全な軌跡生成を改善するか。

主な発見

Model	RIR	RMR	T-IoU	C-IoU	OR
HomeGuard-4B	90.00	63.72	0.6709	0.4873	21.96
HomeGuard-8B	90.98	74.90	0.7206	0.5562	13.14
Qwen3-VL-4B-Thinking	66.67	33.14	0.5902	0.2212	29.31
Qwen3-VL-8B-Thinking	67.58	33.20	0.5732	0.2694	32.62
RoboBrain2.5-8B	56.15	24.00	0.4961	0.1245	41.18
Qwen3-VL-235B-Thinking	77.17	45.08	0.6964	0.4124	34.69

HomeGuard-8BはHomeSafe-BenchでRIR 90.98%とRMR 74.90%を達成し、ベースラインを上回る。
HomeGuard-8BはT-IoU 0.7206とC-IoU 0.5562を達成し、OR 13.14%となっており、強力な危険グラウンディングと低い過度の安全性を示す。
視覚的プロセス報酬を伴う2段階トレーニングが重要で、IoU報酬を除くとRIR/RMRが低下し、過度の安全性が増加する。
HomeGuardは4つの公的ベンチマークへ強い一般化を示し、非営利モデルトップの結果を達成（例：EARBenchでRIR 94.73%、RMR 72.12%）。
視覚的アンカーの統合により安全な計画が可能となり、IS-Benchの安全成功率が61.36%から73.91%へ向上する。

Figure 2 : The two-stage training pipeline and a visualization of the sequential reasoning process for detecting risk identification in household tasks.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。