QUICK REVIEW

[논문 리뷰] Learning from a tiny dataset of manual annotations: a teacher/student approach for surgical phase recognition

Tong Yu, Didier Mutter|arXiv (Cornell University)|2018. 11. 30.

Surgical Simulation and Training참고 문헌 19인용 수 44

한 줄 요약

본 논문은 CNN-biLSTM-CRF 교사를 이용해 비주석 비디오에 합성 표식을 생성하고 실시간 CNN-LSTM 학생을 학습시키는 교사-학생 반지도 학습 프레임워크를 제안하며, 매우 제한된 주석에서도 성능이 향상됨을 보인다.

ABSTRACT

Vision algorithms capable of interpreting scenes from a real-time video stream are necessary for computer-assisted surgery systems to achieve context-aware behavior. In laparoscopic procedures one particular algorithm needed for such systems is the identification of surgical phases, for which the current state of the art is a model based on a CNN-LSTM. A number of previous works using models of this kind have trained them in a fully supervised manner, requiring a fully annotated dataset. Instead, our work confronts the problem of learning surgical phase recognition in scenarios presenting scarce amounts of annotated data (under 25% of all available video recordings). We propose a teacher/student type of approach, where a strong predictor called the teacher, trained beforehand on a small dataset of ground truth-annotated videos, generates synthetic annotations for a larger dataset, which another model - the student - learns from. In our case, the teacher features a novel CNN-biLSTM-CRF architecture, designed for offline inference only. The student, on the other hand, is a CNN-LSTM capable of making real-time predictions. Results for various amounts of manually annotated videos demonstrate the superiority of the new CNN-biLSTM-CRF predictor as well as improved performance from the CNN-LSTM trained using synthetic labels generated for unannotated videos. For both offline and online surgical phase recognition with very few annotated recordings available, this new teacher/student strategy provides a valuable performance improvement by efficiently leveraging the unannotated data.

연구 동기 및 목표

수술 단계 인식 문제를 매우 제한된 수작업 주석 비디오 데이터로 해결하는 것.
강력한 오프라인 예측기가 비주석 비디오에 합성 표식을 생성하는 교사/학생 프레임워크를 제안한다.
합성 표식이 실시간 CNN-LSTM 학생의 성능을 개선하고 완전 감독 학습 baseline에 근접함을 보여준다.
동일한 프레임워크 내에서 오프라인 및 온라인 추론 능력을 비교한다.

제안 방법

합성 주석 생성을 위한 오프라인 추론용 CNN-biLSTM-CRF 교사를 도입한다.
프레임에서 2048-d 시각 특징을 추출하기 위해 ResNet-50 v2 CNN을 사용한다.
향후 맥락을 포착하기 위해 양방향 LSTM을 도입하고, 단계 전이를 모델링하기 위해 선형-연쇄 CRF를 사용한다.
Ground-truth와 교사 생성 표식(G_{i,j})의 혼합 데이터셋을 사용해 실시간 예측용 CNN-LSTM 학생을 훈련한다.
교사를 위한 시간 역전파를 통한 엔드-투-엔드 학습과 데이터 증강을 적용한다.
7-단계 라벨을 갖는 cholec120 데이터셋에서 크기 1–80의 여러 미니 학습 세트를 평가한다.

실험 결과

연구 질문

RQ1희소한 주석으로 학습된 교사 모델이 수술 단계 인식에서 비주석 비디오에 유용한 합성 표식을 생성할 수 있는가?
RQ2교사 생성 표식 접근법이 희소한 ground-truth 데이터로만 학습하는 것과 비교해 실시간 CNN-LSTM 학생의 성능을 향상시키는가?
RQ3주석 데이터가 증가할 때 반지도 학습이 완전 감독 학습에 얼마나 근접하게 성능을 끌어올릴 수 있는가?
RQ4교사 아키텍처(CNN-biLSTM-CRF)와 더 간단한 모델 간의 차이가 오프라인 및 온라인 예측 성능에 어떤 영향을 미치는가?

주요 결과

CNN-biLSTM-CRF 교사는 제거 변형들을 능가하며 오프라인 모델 중에서 가장 강력한 예측기로 작동한다.
CNN-LSTM은 교사 생성 합성 표식으로 학습될 때 ground-truth만으로 학습하는 것보다 크게 개선되어 데이터 격차를 줄인다.
수동으로 주석된 비디오가 단 20개에 불가할 때도 CNN-biLSTM-CRF는 테스트 세트에서 84.1% 정확도 및 75.8% F1을 달성하여 완전 감독 학습의 89.5% 정확도 및 82.5% F1에 근접한다.
교사 생성 주석의 품질은 더 많은 수의 수작업 주석 비디오가 있을수록 향상되며, G_{i,j} 세트를 학생 훈련에 점점 더 타당하게 만든다.
합성 표식을 사용하면 CNN-LSTM 온라인 예측기의 20개와 80개 ground-truth 비디오 사이의 격차가 절반으로 줄어들고, 학생을 오프라인 예측기로 대체하면 격차를 완전히 닫을 수 있다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.