QUICK REVIEW

[論文レビュー] Pedestrian Crossing Intent Prediction via Psychological Features and Transformer Fusion

Sima Ashayer, Hoang Dai Nghia Nguyen|arXiv (Cornell University)|Mar 20, 2026

Autonomous Vehicle Technology and Safety被引用数 0

ひとこと要約

論文は、心理的・位置情報・状況・相互作用の手掛かりを統合して歩行者の横断意図を予測する軽量で社会情報を取り入れたマルチストリーム Transformer を提案し、カルibrated なリスク評価のための不確実性ヘッド（KL ダイバージェンスと Mahalanobis 距離）を持つ。PSI 1.0 で強力な結果を達成し、PSI 2.0 のベースラインを確立。

ABSTRACT

Pedestrian intention prediction needs to be accurate for autonomous vehicles to navigate safely in urban environments. We present a lightweight, socially informed architecture for pedestrian intention prediction. It fuses four behavioral streams (attention, position, situation, and interaction) using highway encoders, a compact 4-token Transformer, and global self-attention pooling. To quantify uncertainty, we incorporate two complementary heads: a variational bottleneck whose KL divergence captures epistemic uncertainty, and a Mahalanobis distance detector that identifies distributional shift. Together, these components yield calibrated probabilities and actionable risk scores without compromising efficiency. On the PSI 1.0 benchmark, our model outperforms recent vision language models by achieving 0.9 F1, 0.94 AUC-ROC, and 0.78 MCC by using only structured, interpretable features. On the more diverse PSI 2.0 dataset, where, to the best of our knowledge, no prior results exist, we establish a strong initial baseline of 0.78 F1 and 0.79 AUC-ROC. Selective prediction based on Mahalanobis scores increases test accuracy by up to 0.4 percentage points at 80% coverage. Qualitative attention heatmaps further show how the model shifts its cross-stream focus under ambiguity. The proposed approach is modality-agnostic, easy to integrate with vision language pipelines, and suitable for risk-aware intent prediction on resource-constrained platforms.

研究の動機と目的

歩行者が横断意図を持つときの安全な自動運転の動機付け。
心理学的根拠に基づく跨モーダル手掛かりを活用して予測精度と解釈性を向上。
リソース制約があるプラットフォームに適した軽量でリアルタイムなアーキテクチャの開発。
リスク認識型の意思決定を支える校正済み確率と不確実性指標の提供。

提案手法

4つの特徴ストリーム—心理的注意、位置情報、状況、相互作用—を highway encoders で融合。
グローバル自己注意プーリングを用いたコンパクトな4トークン Transformer でクロスストリーム統合表現を生成。
残差 MLP 分類器を導入して横断確率と2つの不確実性ヘッドを提供：変分 KL 事前ヘッドと Mahalanobis 距離検出器。
推論時に MixUp、ラベルスムージング、学習可能温度を適用して確率を校正。
推論時にはビデオ backbone を用いずに、1–2M パラメータ程度の小さく解釈性の高いアーキテクチャを実装。
不確実性を伴う思考過程を示す定性的な注意ヒートマップを含め、クロスストリーム推論を可視化。

実験結果

リサーチクエスチョン

RQ1心理・空間・状況・相互作用の cues を軽量なマルチストリーム Transformer が効果的に融合して歩行者横断意図を予測できるか。
RQ2不確実性信号（KL ダイバージェンスと Mahalanobis 距離）は予測信頼性とリスク認識的回避をどのように結びつけるか。
RQ3各特徴ストリームがクロスモーダル融合と予測精度に及ぼす影響はどれか。
RQ4PSI 1.0 と PSI 2.0 のデータセットで手法は頑健かつ校正が容易か。

主な発見

Method	Accuracy	F1	AUC-ROC	MCC
SF-GRU	0.788	0.719	0.752	0.452
SingleRNN	0.782	0.714	0.734	0.440
MultiRNN	0.658	0.611	0.666	0.229
PCPA	0.682	0.584	0.611	0.176
ARN	0.688	0.618	0.671	0.237
PSI	0.759		-	0.374
TrEP	0.830	0.771		-
ClipCross	0.830	0.795	0.855	0.591
Ours(PSI1)	0.88 ± 0.01	0.90 ± 0.01	0.94 ± 0.01	0.78 ± 0.02

PSI 1.0 で提案モデルは 0.88 正確度、0.90 F1、0.94 AUC-ROC、0.78 MCC（平均± seeds）を達成。
PSI 2.0 で 0.78 F1 および 0.79 AUC-ROC（平均± seeds）のベースラインを確立。
アブレーションにより状況ストリームが最も影響力が大きいことが示され、除去時には顕著な低下、注意・位置情報ストリームは補完的な向上、相互作用ストリームは小さめまたは中立的な効果。
クラス条件付き Mahalanobis スコアを用いた選択的予測は、固定カバレッジで精度を高める（例: PSI 1.0 および PSI 2.0 の結果を改善する80%程度の保持率へ）。
2つの不確実性ヘッド（KL と Mahalanobis）は異常検知と校正済み確率の補完信号を提供し、信頼性と解釈性を向上。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。