[论文解读] Unsupervised temporal context learning using convolutional neural networks for laparoscopic workflow analysis
该论文提出了一种用于卷积神经网络(CNN)的无监督预训练方法,通过从无标注的腹腔镜视频帧中学习时间上下文,无需人工标注。通过训练CNN来预测从无标注视频中提取的图像对的时间顺序,模型学习到对手术流程分割有用的判别性特征,在胆囊切除术和结直肠手术数据集上实现了最先进性能,且仅需极少的标注数据。
Computer-assisted surgery (CAS) aims to provide the surgeon with the right type of assistance at the right moment. Such assistance systems are especially relevant in laparoscopic surgery, where CAS can alleviate some of the drawbacks that surgeons incur. For many assistance functions, e.g. displaying the location of a tumor at the appropriate time or suggesting what instruments to prepare next, analyzing the surgical workflow is a prerequisite. Since laparoscopic interventions are performed via endoscope, the video signal is an obvious sensor modality to rely on for workflow analysis. Image-based workflow analysis tasks in laparoscopy, such as phase recognition, skill assessment, video indexing or automatic annotation, require a temporal distinction between video frames. Generally computer vision based methods that generalize from previously seen data are used. For training such methods, large amounts of annotated data are necessary. Annotating surgical data requires expert knowledge, therefore collecting a sufficient amount of data is difficult, time-consuming and not always feasible. In this paper, we address this problem by presenting an unsupervised method for training a convolutional neural network (CNN) to differentiate between laparoscopic video frames on a temporal basis. We extract video frames at regular intervals from 324 unlabeled laparoscopic interventions, resulting in a dataset of approximately 2.2 million images. From this dataset, we extract image pairs from the same video and train a CNN to determine their temporal order. To solve this problem, the CNN has to extract features that are relevant for comprehending laparoscopic workflow. Furthermore, we demonstrate that such a CNN can be adapted for surgical workflow segmentation. We performed image-based workflow segmentation on a publicly available dataset of 7 cholecystectomies and 9 colorectal interventions.
研究动机与目标
- 解决用于训练手术流程分析模型的腹腔镜视频标注数据有限的挑战。
- 开发一种方法,从无标注的腹腔镜视频中学习时间表征,而无需专家标注。
- 通过自监督预训练,实现下游任务(如手术阶段检测)的迁移学习。
- 在复杂、长时间的手术(如结直肠手术)中,证明该预训练方法的有效性。
提出的方法
- 从324个无标注的腹腔镜手术中,以固定间隔提取了220万张视频帧。
- 从同一视频序列中形成图像对,构建二分类任务:预测哪张图像在时间上更早。
- 端到端训练CNN以分类这些图像对的时间顺序,迫使模型学习判别性时空特征。
- 使用基于GRU的架构对预训练CNN进行微调,以建模序列依赖关系,用于手术阶段分割。
- 结合预训练特征与循环建模,提升阶段检测任务的性能。
- 在两个公开数据集上评估该方法:7例胆囊切除术和9例结直肠手术,报告阶段级别的性能。
实验结果
研究问题
- RQ1CNN是否能在无任何人工标注的情况下,从腹腔镜视频中学习到有意义的时间表征?
- RQ2在时间顺序预测上进行无监督预训练,是否能提升手术阶段分割任务的性能?
- RQ3当仅有少量标注数据可用时,该方法与监督基线相比表现如何?
- RQ4预训练模型是否能泛化到复杂、非标准化的手术(如结直肠手术)?
- RQ5在预训练特征之上引入循环建模(如GRU)对序列化手术流程分析有何影响?
主要发现
- 该无监督预训练方法在公开的胆囊切除术数据集上实现了最先进性能,优于Dergachyova等人提出的方法以及纯CNN结构的EndoNet。
- 在结直肠手术数据集中,预训练模型显著优于随机初始化的CNN,证明了其在高手术者差异性下的可迁移性。
- 结合预训练特征的GRU架构取得了最高性能,在胆囊切除术数据集上平均F1得分为80.8%,在结直肠数据集的P6阶段达到88.2%。
- 结直肠数据集中阶段4和阶段7的性能最低(F1分别为57.7%和55.7%),主要由于持续时间短且易与邻近阶段混淆。
- 该方法通过仅利用时间顺序作为监督信号,显著降低了对昂贵人工标注的依赖,实现了大规模无标注数据的有效预训练。
- 最终全连接层(fc6)的输出可作为手术视频数据库中视频索引与检索任务的紧凑且有意义的表征。
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。