QUICK REVIEW

[論文レビュー] InspecSafe-V1: A Multimodal Benchmark for Safety Assessment in Industrial Inspection Scenarios

Zeyi Liu, Shuang Liu|arXiv (Cornell University)|Jan 29, 2026

Infrastructure Maintenance and Monitoring被引用数 0

ひとこと要約

InspecSafe-V1 は、実世界の産業検査における安全評価の最初のマルチモーダルベンチマークで、7つの同期モダリティ、ピクセルレベルのオブジェクト注釈、シーン説明、および5つの産業シナリオにわたる安全ラベルを提供する。

ABSTRACT

With the rapid development of industrial intelligence and unmanned inspection, reliable perception and safety assessment for AI systems in complex and dynamic industrial sites has become a key bottleneck for deploying predictive maintenance and autonomous inspection. Most public datasets remain limited by simulated data sources, single-modality sensing, or the absence of fine-grained object-level annotations, which prevents robust scene understanding and multimodal safety reasoning for industrial foundation models. To address these limitations, InspecSafe-V1 is released as the first multimodal benchmark dataset for industrial inspection safety assessment that is collected from routine operations of real inspection robots in real-world environments. InspecSafe-V1 covers five representative industrial scenarios, including tunnels, power facilities, sintering equipment, oil and gas petrochemical plants, and coal conveyor trestles. The dataset is constructed from 41 wheeled and rail-mounted inspection robots operating at 2,239 valid inspection sites, yielding 5,013 inspection instances. For each instance, pixel-level segmentation annotations are provided for key objects in visible-spectrum images. In addition, a semantic scene description and a corresponding safety level label are provided according to practical inspection tasks. Seven synchronized sensing modalities are further included, including infrared video, audio, depth point clouds, radar point clouds, gas measurements, temperature, and humidity, to support multimodal anomaly recognition, cross-modal fusion, and comprehensive safety assessment in industrial environments.

研究の動機と目的

RGB のみやシミュレーションデータセットを超える高品質でマルチモーダルな産業安全データの必要性を動機づける。
運用ディスタービュanceを捉える実世界のマルチモーダルベンチマークを提供し、堅牢な安全推論を促進する。
産業現場におけるシーン理解と安全評価のために視覚-言語モデルの評価を可能にする。

提案手法

5つの産業シナリオにわたる41台の車輪式・レール搭載検査ロボットから実世界のマルチモーダルデータを収集する。
RGBフレーム上のピクセルレベルのインスタンスセグメンテーションと、安全レベルを含むシーンレベルの言語注釈を提供する。
RGB、熱、深度、レーダー、音声、ガス、温度、湿度のモダリティを検査地点レベルで同期・整合させる。
234のオブジェクトカテゴリを構造化された分類体系で注釈付けし、標準化された基準を用いてシーンごとに安全レベルを割り当てる。
固定プロンプトとルールベースの解析方式を用いてRGBフレームからシーン説明と離散的な安全レベルを生成し、視覚-言語モデルを評価する。

実験結果

リサーチクエスチョン

RQ1多モーダル入力を用いて、一般目的の視覚-言語モデルは複雑な産業シーンで安全レベル予測をどの程度うまく行えるか。
RQ2知覚的ロバストネス、推論能力、安全判断精度の関係性は産業環境でどう現れるか。
RQ3モデルサイズと推論志向のアーキテクチャがInspecSafe-V1における安全評価性能と偽陽性率にどう影響するか。
RQ4騒がしい産業環境でVLMベースの安全評価の一般的な故障モード（例：誤シーン分類、ハザードの見逃し）は何か。

主な発見

ベンチマークはRGBフレームを介したシーン説明と安全レベルの評価を可能にし、測定可能な精度と意味的類似度指標を達成する。
推論志向のモデルは命令のみの variants を上回り、安全精度の向上と偽陽性の低減が顕著である。
モデルの性能はパラメータ数に厳密には依存せず、知覚の堅牢さと推論の整合性が結果を左右する。
偽陽性は照明、反射、遮蔽に起因しやすく、シーンの誤分類が安全判断の連鎖的誤りにつながることがある。
レール搭載プラットフォームがデータ量の大半を占め、プラットフォーム間の一般化課題と長尾現象が顕在化している。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。