[Paper Review] Self-Supervised Surgical Tool Segmentation using Kinematic Information
The paper presents SSTS, a self-supervised method that uses a robot's kinematic model to generate training labels for FCN-based surgical tool segmentation, achieving near fully-supervised performance without manual annotations.
Surgical tool segmentation in endoscopic images is the first step towards pose estimation and (sub-)task automation in challenging minimally invasive surgical operations. While many approaches in the literature have shown great results using modern machine learning methods such as convolutional neural networks, the main bottleneck lies in the acquisition of a large number of manually-annotated images for efficient learning. This is especially true in surgical context, where patient-to-patient differences impede the overall generalizability. In order to cope with this lack of annotated data, we propose a self-supervised approach in a robot-assisted context. To our knowledge, the proposed approach is the first to make use of the kinematic model of the robot in order to generate training labels. The core contribution of the paper is to propose an optimization method to obtain good labels for training despite an unknown hand-eye calibration and an imprecise kinematic model. The labels can subsequently be used for fine-tuning a fully-convolutional neural network for pixel-wise classification. As a result, the tool can be segmented in the endoscopic images without needing a single manually-annotated image. Experimental results on phantom and in vivo datasets obtained using a flexible robotized endoscopy system are very promising.
Motivation & Objective
- Address the lack of annotated data in surgical tool segmentation by leveraging robot kinematics as a labeling signal.
- Develop a method to estimate a useful hand-eye transformation despite kinematic/model errors.
- Fine-tune a lightweight FCN online to perform pixel-wise segmentation using self-generated labels.
- Validate the approach on phantom and in vivo endoscopic datasets with flexible continuum robots.
Proposed method
- Model-based label generation: project the robot along with an estimated shape into the image using a transformation T and the kinematic model to obtain projected labels y(q, T).
- Grabcut-based optimization: maximize a F'1 score between Grabcut output and projected labels by optimizing T over SE3 with a stochastic branch-and-bound search.
- Two-step workflow: (i) compute T* to align model projection with image observations, (ii) use resulting projections to train a Fully Convolutional Network (FCN) for pixel-wise segmentation.
- FCN architecture: ResNet18-based backbone with two upsampling paths to produce per-pixel scores, trained with a weighted cross-entropy loss and L2 regularization.
- Online fine-tuning: perform data augmentation and end-to-end training to adapt the FCN to the specific surgery and imaging conditions.
- Post-processing: apply Conditional Random Fields to refine FCN segmentation outputs.
Experimental results
Research questions
- RQ1Can a self-supervised approach using the robot's kinematic model generate reliable labels for surgical tool segmentation without manual annotations?
- RQ2How effectively can a hand-eye transformation be optimized in the presence of kinematic and calibration errors using a Grabcut-based cost function?
- RQ3Does FCN fine-tuning with self-generated labels approach the performance of fully supervised learning on phantom and in vivo data?
- RQ4What is the impact of endoscopic-domain pre-training on segmentation performance for challenging in vivo scenarios?
Key findings
| Dataset | Approach | Accuracy | IoU | Recall | Precision |
|---|---|---|---|---|---|
| Phantom 1 | SSTS | 0.99 | 0.86 | 0.90 | 0.92 |
| Phantom 1 | FSL | 0.99 | 0.87 | 0.92 | 0.93 |
| Phantom 1 | Grabcut | 0.97 | 0.56 | 0.86 | 0.61 |
| Phantom 2 | SSTS | 0.98 | 0.78 | 0.88 | 0.87 |
| Phantom 2 | FSL | 0.98 | 0.84 | 0.88 | 0.94 |
| Phantom 2 | Grabcut | 0.95 | 0.49 | 0.66 | 0.66 |
| In Vivo | SSTS | 0.97 | 0.62 | 0.66 | 0.91 |
| In Vivo | FSL | 0.98 | 0.72 | 0.73 | 0.98 |
| In Vivo | Grabcut | 0.96 | 0.55 | 0.73 | 0.69 |
- Optimization of T* with the Grabcut-based cost correlates with IoU to GT across phantom and in vivo datasets, enabling meaningful labels without ground truth.
- SSTS performance is close to fully supervised learning across phantom 1, phantom 2, and in vivo datasets, with similar IoU, recall, and precision metrics.
- On phantom 1, SSTS achieves 0.99 accuracy and 0.86 IoU, close to FSL which achieves 0.99 accuracy and 0.87 IoU.
- On phantom 2, SSTS achieves 0.98 accuracy and 0.78 IoU, close to FSL which achieves 0.98 accuracy and 0.84 IoU.
- On in vivo data, SSTS achieves 0.97 accuracy and 0.62 IoU, with FSL at 0.98 accuracy and 0.72 IoU; Grabcut baselines show notably lower IoU.
- Endoscopic-domain fine-tuning improves ROC performance versus ImageNet pre-training, highlighting benefits of domain-specific pre-training for endoscopic data.
Better researchstarts right now
From paper design to paper writing, dramatically reduce your research time.
No credit card · Free plan available
This review was created by AI and reviewed by human editors.