Skip to main content
QUICK REVIEW

[論文レビュー] When the City Teaches the Car: Label-Free 3D Perception from Infrastructure

Zhen Xu, Jinsu Yoo|arXiv (Cornell University)|Mar 17, 2026
Advanced Neural Network Applications被引用数 0
ひとこと要約

この論文は、 stationary roadside units (RSUs) が unlabeled data から学習し、ego vehicle detector をオフラインで訓練するための疑似ラベルを提供する、インフラ構築教示型の3D認識を提案する。CARLA CIVET の研究では車両に対して 82.3% AP、監視付き上限は 94.4%。

ABSTRACT

Building robust 3D perception for self-driving still relies heavily on large-scale data collection and manual annotation, yet this paradigm becomes impractical as deployment expands across diverse cities and regions. Meanwhile, modern cities are increasingly instrumented with roadside units (RSUs), static sensors deployed along roads and at intersections to monitor traffic. This raises a natural question: can the city itself help train the vehicle? We propose infrastructure-taught, label-free 3D perception, a paradigm in which RSUs act as stationary, unsupervised teachers for ego vehicles. Leveraging their fixed viewpoints and repeated observations, RSUs learn local 3D detectors from unlabeled data and broadcast predictions to passing vehicles, which are aggregated as pseudo-label supervision for training a standalone ego detector. The resulting model requires no infrastructure or communication at test time. We instantiate this idea as a fully label-free three-stage pipeline and conduct a concept-and-feasibility study in a CARLA-based multi-agent environment. With CenterPoint, our pipeline achieves 82.3% AP for detecting vehicles, compared to a fully supervised ego upper bound of 94.4%. We further systematically analyze each stage, evaluate its scalability, and demonstrate complementarity with existing ego-centric label-free methods. Together, these results suggest that city infrastructure itself can potentially provide a scalable supervisory signal for autonomous vehicles, positioning infrastructure-taught learning as a promising orthogonal paradigm for reducing annotation cost in 3D perception.

研究の動機と目的

  • 固定の RSU を unsupervised 教師として活用することで、複数都市における3D認識の注釈コストを削減する動機付け。
  • RSU が unlabeled data から学習し、予測を疑似ラベルとして放送し、ego detector をオフラインで訓練する完全なラベルフリーパイプラインを3段階で開発する。
  • シミュレートされた複数都市環境で、実現可能性、スケーラビリティ、ego中心のラベルフリーメソッドとの補完性を系統的に検討する。

提案手法

  • Stage 1: Unsupervised RSU training where each RSU learns a location-specialized detector using temporal consistency and persistence-based pseudo-labels.
  • Stage 2: RSUs broadcast predictions to passing ego vehicles; ego aggregates these into pseudo-labels with distance-weighted NMS and simple class matching.
  • Stage 3: Ego detector training offline using aggregated infrastructure-derived pseudo-labels to yield a standalone ego model at test time.
  • Evaluation relies on CenterPoint and PointPillars detectors in BEV AP metrics; analyzes impact of communication noise and pseudo-label refinement.
  • Dataset CIVET built on CARLA and V2XVerse with 4 towns and 12 RSUs per town to study geo-specific supervision and scalability.
Figure 1 : Can city infrastructure teach vehicles to perceive? We explore a new paradigm where roadside infrastructure acts as distributed teachers, providing supervision to train ego perception models without manual annotations.
Figure 1 : Can city infrastructure teach vehicles to perceive? We explore a new paradigm where roadside infrastructure acts as distributed teachers, providing supervision to train ego perception models without manual annotations.

実験結果

リサーチクエスチョン

  • RQ1Can stationary RSUs learn reliable, label-free detectors from unlabeled observations?
  • RQ2Can RSU-generated pseudo-labels train a competitive ego detector that operates without infrastructure at test time?
  • RQ3How do factors like RSU quantity, placement, and communication noise affect downstream ego performance?
  • RQ4Do infrastructure-generated pseudo-labels complement ego-centric unsupervised methods and enable cross-town generalization?

主な発見

  • Fully label-free pipeline yields 82.3% AP for vehicles within a town, approaching a 94.4% supervised ego upper bound.
  • Across four towns, 82.7% AP vs an upper bound of 91.0% when training with aggregated RSU supervision.
  • Tracking and unsupervised RSU training improve pseudo-label quality and ego performance; communication noise degrades localization, especially for pedestrians.
  • Auxiliary refinement (box refinement) improves pseudo-label quality and ego AP under noisy conditions.
  • Combining infrastructure pseudo-labels with ego-centric methods (e.g., Oyster) yields additional performance gains.
  • RSU detectors are location-specific and do not generalize directly to other RSU viewpoints, motivating a distributed teacher ensemble.
Figure 2 : Overview of infrastructure-taught, label-free 3D perception. Stage 1: each RSU learns a location-specialized detector in an unsupervised manner by exploiting temporal consistency from its stationary viewpoint. Stage 2: trained RSUs broadcast their predicted 3D bounding boxes to nearby ego
Figure 2 : Overview of infrastructure-taught, label-free 3D perception. Stage 1: each RSU learns a location-specialized detector in an unsupervised manner by exploiting temporal consistency from its stationary viewpoint. Stage 2: trained RSUs broadcast their predicted 3D bounding boxes to nearby ego

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。