QUICK REVIEW

[논문 리뷰] CoFL: Continuous Flow Fields for Language-Conditioned Navigation

Haokun Liu, Zhaoqi MA|arXiv (Cornell University)|2026. 03. 03.

Multimodal Machine Learning Applications인용 수 0

한 줄 요약

CoFL은 BEV 이미지와 언어 지시를 연속적인 흐름 필드로 매핑하는 엔드투엔드 정책을 제시하여, 필드를 경로로 적분해 실시간으로 매끄럽고 장애물 인식 네비게이션을 가능하게 한다. 이는 보지 못한 장면에 대한 강한 일반화와 제로샷 실세계 전이를 달성한다.

ABSTRACT

Language-conditioned navigation pipelines often rely on brittle modular components or costly action-sequence generation. To address these limitations, we present CoFL, an end-to-end policy that directly maps a bird's-eye view (BEV) observation and a language instruction to a continuous flow field for navigation. Instead of predicting discrete action tokens or sampling action chunks via iterative denoising, CoFL outputs instantaneous velocities that can be queried at arbitrary 2D projected locations. Trajectories are obtained by numerical integration of the predicted field, producing smooth motion that remains reactive under closed-loop execution. To enable large-scale training, we build a dataset of over 500k BEV image-instruction pairs, each procedurally annotated with a flow field and a trajectory derived from BEV semantic maps built on Matterport3D and ScanNet. By training on a mixed distribution, CoFL significantly outperforms modular Vision-Language Model (VLM)-based planners and generative policy baselines on strictly unseen scenes. Finally, we deploy CoFL zero-shot in real-world experiments with overhead BEV observations across multiple layouts, maintaining reliable closed-loop control and a high success rate.

연구 동기 및 목표

브리핑 리스크 없이 brittle 모듈형 파이프라인과 이산 액션 토큰을 피하는 언어 조건 내비게이션의 필요성을 촉진한다.
언어에 조건화된 BEV 공간 위의 2D 흐름 필드를 예측하는 비전-언어 트랜스포머를 제안한다.
절차적으로 주석이 달린 흐름 필드와 경로를 포함하는 대규모 BEV 이미지 및 지시쌍으로부터 학습한다.
예측된 필드를 수치적 적분하여 실시간으로 폐루프 내비게이션을 가능하게 한다.

제안 방법

I와 l로부터 BEV 공간 위의 언어 조건 흐름 필드 v(x|I, l)를 트랜스포머 기반 인코더-디코더를 사용하여 예측한다.
다중 2D 좌표에서 흐름 필드를 질의하여 속도 벡터를 생성하고 연속 필드를 형성한다.
속도를 비음수 크기 M(x)와 단위 방향 D(x)로 표현하고; V(x)=M(x)·D(x)로 계산한다.
V*와의 지시된 감독에 대해 방향 코사인 손실 및 크기 손실을 사용하여 샘플링된 질의 포인트로 학습한다.
거의 일정 속도를 유지하기 위해 역 시간 재스케일링을 사용한 조밀한 격자에서 forward Euler 적분으로 경로를 추론한다.

Figure 2 : Overview of the CoFL’s network architecture. Given a RGB BEV observation $I$ and a language instruction $\ell$ , a SigLIP 2-based [ 39 , 37 ] vision–language encoder produces language-conditioned context tokens over the BEV image. The decoder then queries this context with 2D normalized s

실험 결과

연구 질문

RQ1단일 엔드투엔드 모델이 BEV 관찰에서 언어 조건 내비게이션을 위한 기하학적으로 의식적인 연속 흐름 필드를 학습할 수 있는가?
RQ2명시적 흐름 필드 감독이 unseen 장면에서 모듈식 또는 생성 기반 베이스라인에 비해 안전성(충돌 방지)과 경로 품질을 향상시키는가?
RQ3모델이 실제 환경에서 세부 조정 없이도 실제 세계의 폐루프 내비게이션으로 이전될 수 있는가?
RQ4그리드 해상도와 추론 예산이 내비게이션 성능과 안전성에 어떤 영향을 미치는가?

주요 결과

CoFL은 모듈식 VLM 플래너 및 확산 정책 베이스라인에 비해 충돌률(CR)을 크게 감소시키면서 최종 목표 오차(FGE)는 유사하게 유지한다.
Matterport3D에서 CoFL은 FGE ~0.13–0.15, CR ~0.17–0.22, Curv ~0.08–0.14로 베이스라인을 능가한다.
ScanNet에서 CoFL은 FGE를 ~0.07–0.09로 개선하고 CR ~0.35–0.40을 달성하며 헤드가 훨씬 작게 (~15M 매개변수) 구성된다.
흐름 필드 예측은 국소적으로 정확하고(AE/ME) 모든 위치에서 장애물 인식형 롤아웃을 가능하게 하여 경로만으로의 방법에서 발생하는 기하학적 간극을 해결한다.
미세 조정 없이 실제 배치에서 여러 배치와 정적/동적 장애물이 있는 레이아웃에서 약 28 ms의 단계 지연으로 강건한 폐루프 제어를 보여준다.
제로샷 실세계 내비게이션은 장애물 없는 경우 목표 지점에 85%의 온타깃 비율, 장애물이 있는 경우 100%의 온타깃 비율을 보이면서 안전한 간격을 유지하는 신뢰성 있는 성능을 시연한다.

Figure 3 : Overview of the trajectory inference. The predicted flow field over the workspace guides agents from different starts toward the same goal while smoothly avoiding obstacles.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.