QUICK REVIEW

[논문 리뷰] HR-Depth: High Resolution Self-Supervised Monocular Depth Estimation

Xiaoyang Lyu, Liang Liu|arXiv (Cornell University)|2020. 12. 14.

Advanced Vision and Imaging참고 문헌 26인용 수 24

한 줄 요약

HR-Depth는 의미-공간 간격을 줄이고 파라미터 효율적인 특징 융합 Squeeze-and-Excitation (fSE) 모듈을 도입하여 고해상도 깊이 예측을 향상시키기 위해 재설계된 스킵 커넥션을 사용하는 새로운 자기지도 학습 단안 깊이 추정 네트워크를 제안한다. 이는 파라미터 수가 현저히 적은 상태에서 KITTI 데이터셋에서 최신 기술 성능을 달성하며, 단지 310만 개의 파라미터만을 사용하는 경량 버전도 고해상도에서 Monodepth2의 정확도를 재현한다.

ABSTRACT

Self-supervised learning shows great potential in monoculardepth estimation, using image sequences as the only source ofsupervision. Although people try to use the high-resolutionimage for depth estimation, the accuracy of prediction hasnot been significantly improved. In this work, we find thecore reason comes from the inaccurate depth estimation inlarge gradient regions, making the bilinear interpolation er-ror gradually disappear as the resolution increases. To obtainmore accurate depth estimation in large gradient regions, itis necessary to obtain high-resolution features with spatialand semantic information. Therefore, we present an improvedDepthNet, HR-Depth, with two effective strategies: (1) re-design the skip-connection in DepthNet to get better high-resolution features and (2) propose feature fusion Squeeze-and-Excitation(fSE) module to fuse feature more efficiently.Using Resnet-18 as the encoder, HR-Depth surpasses all pre-vious state-of-the-art(SoTA) methods with the least param-eters at both high and low resolution. Moreover, previousstate-of-the-art methods are based on fairly complex and deepnetworks with a mass of parameters which limits their realapplications. Thus we also construct a lightweight networkwhich uses MobileNetV3 as encoder. Experiments show thatthe lightweight network can perform on par with many largemodels like Monodepth2 at high-resolution with only20%parameters. All codes and models will be available at https://github.com/shawLyu/HR-Depth.

연구 동기 및 목표

고해상도 단안 깊이 추정의 열악한 성능, 특히 물체 경계에서의 성능 향상을 해결하기 위해.
U-Net 기반 네트워크에서 인코더와 디코더 특징 간의 의미-공간 간격을 줄이기 위해.
모델 복잡도를 증가시키지 않고도 특징 융합의 효율성과 정확도를 향상시키기 위해.
실제 적용에 적합한 최소한의 파라미터로 높은 성능을 유지하는 경량 네트워크를 설계하기 위해.
정확한 경계 예측이 고해상도 깊이 추정 향상에 핵심적임을 입증하기 위해.

제안 방법

인코더와 디코더 간의 다중 스케일 특징 융합을 가능하게 하기 위해 DepthNet의 스킵 커넥션을 재설계하여 의미 간격을 줄였다.
특징 통합을 향상시키면서도 파라미터 수를 감소시키는 특징 융합 Squeeze-and-Excitation (fSE) 블록을 제안했다.
고해상도(1024×320) 추론을 위해 ResNet-18을 백본 인코더로 사용하여 가장자리 선명도를 향상시켰다.
MobileNetV3을 인코더로 사용하여 단지 310만 개의 파라미터로도 높은 성능을 달성하는 경량 버전을 구성했다.
경량 모델의 훈련을 지도하기 위해 교사 네트워크(Monodepth2)를 사용한 지식 정복을 적용했다.
단일 영상 시퀀스의 기하학적 제약 조건을 이용해 자기지도 학습 방식으로 네트워크를 훈련시켰으며, 진짜 깊이값을 필요로 하지 않았다.

실험 결과

연구 질문

RQ1기존 자기지도 학습 방법에서 해상도를 높여도 왜 깊이 추정 정확도 향상이 이루어지지 않을까?
RQ2고해상도 깊이 추정 네트워크에서 의미 정보와 공간 정보를 어떻게 더 잘 융합할 수 있을까?
RQ3경량 네트워크가 고해상도 깊이 추정에서 대규모 모델과 비교해 유사한 성능을 낼 수 있을까?
RQ4깊이 맵에서 경계 예측을 향상시키는 데 가장 크게 기여하는 아키텍처 구성 요소는 무엇일까?
RQ5스킵 커넥션의 의미 간격을 줄이면 더 선명하고 정확한 깊이 예측이 가능할까?

주요 결과

HR-Depth는 고해상도(1024×320)에서 KITTI 데이터셋에서 최신 기술 성능을 달성했으며, 절대 상대 오차(Abs Rel)가 0.104로 이전 자기지도 학습 방법을 뛰어넘었다.
경량 버전인 Lite-HR-Depth는 1280×384 해상도에서 단지 310만 개의 파라미터로 Abs Rel 0.104를 달성하여 Monodepth2의 1484만 개 파라미터 모델과 동일한 성능을 냈다.
절단 실험 결과, 밀도 높은 스킵 커넥션과 fSE 블록을 함께 사용할 경우 기준 모델인 Monodepth2 대비 Abs Rel이 0.006 감소함을 확인했다.
fSE 블록은 표준 SE 블록 대비 밀도 높은 스킵 커넥션으로 인한 파라미터 증가를 15% 감소시키면서 성능 향상을 이뤘다.
특징 시각화 결과, 밀도 높은 스킵 커넥션은 인코더와 디코더 특징 간의 의미 간격을 크게 줄여 richer하고 고해상도의 의미 표현을 가능하게 했다.
교사 네트워크를 사용한 지식 정복은 경량 모델의 성능을 향상시켜 1024×320 해상도에서 Abs Rel 0.105를 달성했다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.