QUICK REVIEW

[論文レビュー] M2DA: Multi-Modal Fusion Transformer Incorporating Driver Attention for Autonomous Driving

Dongyang Xu, Haokun Li|arXiv (Cornell University)|Mar 19, 2024

Autonomous Vehicle Technology and Safety被引用数 6

ひとこと要約

M2DA はカメラと LiDAR の異種モダリティを視線注意機構と共に統合する LVAFusion を導入し、CARLA で評価してデータ量を抑えつつ最先端の走行性能を達成した。

ABSTRACT

End-to-end autonomous driving has witnessed remarkable progress. However, the extensive deployment of autonomous vehicles has yet to be realized, primarily due to 1) inefficient multi-modal environment perception: how to integrate data from multi-modal sensors more efficiently; 2) non-human-like scene understanding: how to effectively locate and predict critical risky agents in traffic scenarios like an experienced driver. To overcome these challenges, in this paper, we propose a Multi-Modal fusion transformer incorporating Driver Attention (M2DA) for autonomous driving. To better fuse multi-modal data and achieve higher alignment between different modalities, a novel Lidar-Vision-Attention-based Fusion (LVAFusion) module is proposed. By incorporating driver attention, we empower the human-like scene understanding ability to autonomous vehicles to identify crucial areas within complex scenarios precisely and ensure safety. We conduct experiments on the CARLA simulator and achieve state-of-the-art performance with less data in closed-loop benchmarks. Source codes are available at https://anonymous.4open.science/r/M2DA-4772.

研究の動機と目的

Address inefficient multi-modal environment perception in end-to-end autonomous driving.
Incorporate driver attention to enable human-like scene understanding.
Develop LVAFusion to improve cross-modal alignment between images and LiDAR.
Predict ego-vehicle waypoints and auxiliary perception states with a transformer.
Validate performance on CARLA Town05 Long and Longest6 benchmarks.

提案手法

Propose LVAFusion, a cross-attention based fusion module that uses global and local features with positional encoding and view/sensor embeddings.
Incorporate a driver attention prediction module to generate a gaze-based mask that modulates image features.
Use two cross-attention stages (point-cloud then image) to fuse LiDAR and multi-view images into a unified token sequence.
Process fused features with a transformer encoder and a decoder that uses waypoint, perception, and traffic state queries.
Autoregressively predict ego-waypoints with GRU-based increments, and auxiliary perception maps and traffic states.
Train end-to-end via imitation learning on a rule-based expert data set, with L1 losses for waypoints and auxiliary losses for perception and traffic states.]
research_questions: ["Can LVAFusion improve alignment and interaction modeling between LiDAR and camera modalities compared to prior fusion approaches?","Does incorporating driver attention improve end-to-end autonomous driving performance in complex urban/adversarial scenarios?","How does M2DA perform on CARLA Town05 Long and Longest6 benchmarks relative to state-of-the-art methods?","What is the data efficiency of M2DA when trained on smaller datasets?"]
key_findings:["M2DA achieves state-of-the-art driving performance on Town05 Long with DS 72.6±5.7 and IS 0.80±0.05, using 200K training frames.","LVAFusion with prior-informed cross-attention improves multi-modal alignment over random-query baselines.","Incorporating driver attention reduces infractions and improves overall driving score compared to camera-only or non-attentive baselines.","M2DA outperforms Transfuser and Roach across key metrics on Town05 Long, and surpasses several larger data-driven models.","Ablation studies show that adding three-camera inputs and LiDAR with driver attention yields the best results (3C1A1L).","M2DA demonstrates data efficiency by outperforming several 2–3M-frame baselines with only 200K frames."]
table_headers:["手法","フュージョン","モダリティ","追加監視","データセット","DS ↑","RC ↑","IS ↑"],
table_rows:[["CILRS","ResNet + Flatten","C1","None","-","7.8±0.3","10.3±0.0","0.75±0.05"],["LBC","ResNet + Flatten","C3","Expert","157K","12.3±2.0","31.9±2.2","0.66±0.02"],["Transfuser","Fusion via Transformer","C3L1","Dep+Seg+Map+Box","228K","31.0±3.6","47.5±5.3","0.77±0.04"],["Roach","ResNet + Flatten","C1","Expert","-","41.6±1.8","96.4±2.1","0.43±0.03"],["LAV","PointPainting","C4L1","Expert+Seg+Map+Box","189K","46.5±2.3","69.8±2.3","0.73±0.02"],["TCP","ResNet + Flatten","C1","Expert","189K","57.2±1.5","80.4±1.5","0.73±0.02"],["MILE","ResNet + Flatten","C1","Map+Box","2.9M","61.1±3.2","97.4±0.8","0.63±0.03"],["Interfuser","Fusion via Transformer","C3L1","Box","3M","68.3±1.9","95.0±2.9","-"],["ThinkTwice","Geometric Fusion in BEV","C4L1","Expert+Dep+Seg+Map","2M","70.9±3.4","95.5±2.6","0.75±0.05"],["DriveAdapter","Geometric Fusion in BEV","C4L1","Expert+Seg+Map","2M","71.9","97.3","0.74"],["M2DA (ours)","LVAFusion","C3L1","Box","200K","72.6±5.7","89.7±7.8","0.80±0.05"]]]}# ありがとうございました? (Note: There is an extra trailing characters.) Wait: The JSON above seems malformed: there is an extra comma after

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。