QUICK REVIEW

[논문 리뷰] Attention-based Context Aggregation Network for Monocular Depth Estimation

Yuru Chen, Haitao Zhao|arXiv (Cornell University)|2019. 01. 29.

Advanced Vision and Imaging참고 문헌 50인용 수 24

한 줄 요약

이 논문은 고정된 확장률을 사용하는 아트로스 스페이셜 피ラ미드 풀링(ASPP)의 한계를 해결하기 위해 자기주의(self-attention)를 활용해 장거리 픽셀 수준 및 이미지 수준의 맥락을 적응적으로 모델링하는 어텐션 기반 맥락 집합 네트워크(ACAN)를 제안한다. 이는 고정된 확장률로 인한 격자 아티팩트를 감소시키며, 이산화 오차를 최소화하기 위해 소프트 순서 분류를 도입한다. ACAN은 NYU Depth V2 및 KITTI 벤치마크에서 최신 기술 성능(SOTA)을 달성하였으며, ResNet-101을 사용할 경우 KITTI에서 RMSE가 3.599이다.

ABSTRACT

Depth estimation is a traditional computer vision task, which plays a crucial role in understanding 3D scene geometry. Recently, deep-convolutional-neural-networks based methods have achieved promising results in the monocular depth estimation field. Specifically, the framework that combines the multi-scale features extracted by the dilated convolution based block (atrous spatial pyramid pooling, ASPP) has gained the significant improvement in the dense labeling task. However, the discretized and predefined dilation rates cannot capture the continuous context information that differs in diverse scenes and easily introduce the grid artifacts in depth estimation. In this paper, we propose an attention-based context aggregation network (ACAN) to tackle these difficulties. Based on the self-attention model, ACAN adaptively learns the task-specific similarities between pixels to model the context information. First, we recast the monocular depth estimation as a dense labeling multi-class classification problem. Then we propose a soft ordinal inference to transform the predicted probabilities to continuous depth values, which can reduce the discretization error (about 1% decrease in RMSE). Second, the proposed ACAN aggregates both the image-level and pixel-level context information for depth estimation, where the former expresses the statistical characteristic of the whole image and the latter extracts the long-range spatial dependencies for each pixel. Third, for further reducing the inconsistency between the RGB image and depth map, we construct an attention loss to minimize their information entropy. We evaluate on public monocular depth-estimation benchmark datasets (including NYU Depth V2, KITTI). The experiments demonstrate the superiority of our proposed ACAN and achieve the competitive results with the state of the arts.

연구 동기 및 목표

고정된 확장률을 사용하는 아트로스 스페이셜 피라미드 풀링(ASPP)의 한계를 해결하기 위해, 이는 격자 아티팩트를 유발하고 연속적인 시점 맥락을 포착하지 못한다.
자기주의 기반 메커니즘을 활용해 픽셀 수준의 장거리 상관관계와 이미지 수준의 통계적 맥락을 모두 모델링하여 깊이 추정을 향상시킨다.
깊이 예측의 이산화 오차를 줄이기 위해 과제를 소프트 순서 분류 문제로 재정의한다.
RGB 이미지와 예측된 깊이 맵 간의 정렬을 향상시키기 위해 어텐션 기반 엔트로피 최소화 손실을 제안한다.

제안 방법

깊이 추정 과제를 밀도 높은 다중 분류 문제로 재정의하여 순서 확률 학습을 가능하게 한다.
예측된 확률을 연속적인 깊이 값으로 변환하기 위해 소프트 순서 분류를 도입하여 RMSE 기준 약 1%의 이산화 오차 감소를 달성한다.
디코더에 자기주의 모듈을 활용하여 작업에 특화된 픽셀 간 유사성과 장거리 공간적 상관관계를 학습한다.
픽셀 수준의 어텐션을 보완하기 위해 이미지 수준의 풀링 모듈을 도입하여 전반적인 통계적 맥락을 추출한다.
공간 해상도를 유지하고 과도한 다운샘플링을 방지하기 위해 확장 컨벌루션을 사용하는 잔차 인코더(ResNet)를 사용한다.
RGB 특징과 예측된 깊이 맵 간의 정보 엔트로피를 최소화하는 어텐션 기반 손실을 제안하여 특징 정렬을 향상시킨다.

실험 결과

연구 질문

RQ1자기주의 기반 메커니즘이 단일 렌즈 깊이 추정에서 연속적이고 시점 기반 맥락을 효과적으로 모델링할 수 있는가, 고정된 확장률 기반 방법(예: ASPP)을 초월하는가?
RQ2소프트 순서 분류가 표준 회귀나 하드 분류에 비해 깊이 예측의 이산화 오차를 얼마나 줄이는가?
RQ3픽셀 수준과 이미지 수준의 맥락을 통합함으로써 깊이 추정 정확도는 어느 정도 향상되는가?
RQ4RGB와 깊이 특징 간 엔트로피를 최소화하는 어텐션 기반 손실이 특징 일관성과 예측 품질 향상에 기여하는가?

주요 결과

ACAN은 ResNet-101를 사용하여 KITTI 데이터셋에서 RMSE 3.599를 달성하였으며, 비교된 모든 최신 기술 방법을 능가한다.
소프트 순서 분류를 통해 이산화 오차가 약 1% 감소하여 깊이의 연속성이 향상된다.
정성적 결과에서는 ACAN이 격자 아티팩트로 고통받는 ASPP와 같은 방법에 비해 더 선명한 경계와 더 세밀한 깊이 맵을 생성한다.
어텐션 기반 손실은 RGB와 깊이 특징 간 정렬을 크게 향상시켜 예측의 노이즈와 일관성 없는 요소를 감소시킨다.
NYU Depth V2에서 ACAN은 복잡한 시점에서도 일반화 능력과 세부 사항 보존 능력이 향상되어 뛰어난 성능을 보였다.
절단 분석 결과, 픽셀 수준과 이미지 수준의 맥락 집합이 최종 성능 향상에 기여하는 데 기여함을 확인하였다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.