QUICK REVIEW

[논문 리뷰] SigVLP: Sigmoid Volume-Language Pre-Training for Self-Supervised CT-Volume Adaptive Representation Learning

Jiayi Wang, Hadrien Reynaud|arXiv (Cornell University)|2026. 02. 25.

COVID-19 diagnosis using AI인용 수 0

한 줄 요약

SigVLP는 청크 단위 부피와 Rotary Position Embedding을 사용하여 3D CT 비전-언어 모델을 학습하고, 재샘플링 없이도 텍스트-부피 간 미세한 정렬과 강력한 다운스트림 성능을 가능하게 한다.

ABSTRACT

Large-scale, volumetric medical imaging datasets typically aggregate scans from different vendors and devices, resulting in highly variable resolution, slice thicknesses, and numbers of slices per study. Consequently, training representation models usually requires cropping or interpolating along the z-axis to obtain fixed-size blocks, which inevitably causes information loss. We propose a new training approach to overcome this limitation. Instead of absolute position embeddings, we interpret volumes as sequences of 3D chunks and adopt Rotary Position Embeddings, allowing us to treat the z-axis as an unconstrained temporal dimensions. Building on this idea, we introduce a new vision-language model: SigVLP. In SigVLP, we implement Rotary Position Embedding as the positional encoding method, which is applied directly within the attention operation, generating input-conditioned sine and cosine weights on the fly. This design ensures consistent alignment between query and key projections and adapts to any input sizes. To allow for variable input size during training, we sample Computed Tomography volumes in chunks and pair them with localized organ-wise textual observations. Compared to using entire reports for conditioning, chunkwise alignment provides finer-grained supervision, enabling the model to establish stronger correlations between the text and volume representations, thereby improving the precision of text-to-volume alignment. Our models are trained with the Muon optimizer and evaluated on a diverse set of downstream tasks, including zero-shot abnormality and organ classification, segmentation, and retrieval tasks.

연구 동기 및 목표

장치 및 프로토콜 간의 부피 의학 데이터에 대해 z축 정보를 잃지 않으면서 로버스트하고 범용적인 임베딩을 학습하도록 동기를 부여한다.
부분볼륨 관찰과 기관별 방사선소견을 정렬하는 청크-기반의 기관 인식 사전학습 전략 개발.
가변 입력 크기를 처리하기 위해 로터리 포지션 임베딩을 사용하여 고정 길이 z축 제약을 제거한다.
CT-RATE 3D CT 데이터셋에서의 사전학습으로 부피 VLP를 확장하고 보고서에서 추출한 기관별 관찰 정보를 공개한다.

제안 방법

3D CT 부피를 z축 재샘플링을 피하기 위해 3D 청크 시퀀스로 간주한다.
주의(attention) 내에서 직접 RoPE(Rotary Position Embedding)를 적용하여 동적 입력 길이를 가능하게 한다.
가벼운 LLM 보조 파이프라인을 사용하여 보고서를 기관별 소견으로 분해함으로써 즉시 형성되는 기관별 텍스트 감독을 구성한다.
가변 길이 입력 및 청크 기반 감독의 안정성을 위해 Muon 옵티마이저로 학습한다.
VLM 정렬을 평가하기 위해 제로샷 이상반응 분류, 기관 분할, 선형 프로빙, 텍스트-이미지 검색을 평가한다.

실험 결과

연구 질문

RQ13D CT 데이터에서 청크 단위의 기관 인식 감독이 전체 부피 조건과 비교해 텍스트-부피 정렬을 개선할 수 있는가?
RQ2전처리 중 재샘플링 없이 가변 길이 CT 부피를 견고하게 처리할 수 있도록 RoPE가 도움이 되는가?
RQ3CT-부피 VLP 표현이 제로샷 이상 여부 분류, 분할, 검색에서 CT-특정 기준선과 비교하여 어떤 성능을 보이는가?
RQ4즉석에서 생성되는 기관별 관찰이 해부학의 미세한 다중모달 이해를 향상시키는가?

주요 결과

SigVLP는 CT-부피에서 영상의학 보고서 검색으로의 검색에서 기준선 대비 상당한 개선을 달성한다(평균 랭크 8.23 대 CT-Clip 26.01).
선형 프로빙 분류 및 분할은 기관과 과제 전반에서 기준선 대비 경쟁력 있거나 우수한 성능을 보이며, 소형에서 중형 구조에서 특히 큰 이점을 보인다.
임베딩은 초기의 색상 분리에서 광범위한 학습(234,930 스텝) 후 부드럽고 구조화된 공간으로 진화하여 풍부한 구성 표현을 나타낸다.
청크 기반의 기관 정렬 감독이 글로벌 부피 조건화보다 위치 지정 작업에서 더 높은 정밀도를 낳는다.
RoPE 기반 모델링은 가변 길이 입력을 거의 오버헤드 없이 가능하게 하고 표현에서 부피적 일관성을 향상시킨다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.