QUICK REVIEW

[논문 리뷰] SoftDTW-CUDA-Torch: Memory-Efficient GPU-Accelerated Soft Dynamic Time Warping for PyTorch

Ron Shapira Weber, Oren Freifeld|arXiv (Cornell University)|2026. 02. 19.

Time Series Analysis and Forecasting인용 수 0

한 줄 요약

1024 길이 제한을 없애고 로그-스페이스 역전 패스를 사용하는 메모리 효율적인 PyTorch CUDA 구현의 SoftDTW로, 속도 대비 메모리 사용을 선택할 수 있는 fused/unfused 모드를 제공하며 PyTorch autograd 지원과 Soft-DTW 바리센터를 포함합니다.

ABSTRACT

We present softdtw-cuda-torch, an open-source PyTorch library for computing Soft Dynamic Time Warping (SoftDTW) on GPUs. Our implementation addresses three key limitations of existing GPU implementations of SoftDTW: a hard sequence-length cap of 1024, numerical instability in the backward pass for small smoothing parameters, and excessive GPU memory consumption from materializing pairwise distance tensors. We introduce (1) tiled anti-diagonal kernel execution that removes the sequence-length constraint, (2) a log-space back-ward pass that prevents floating-point overflow, and (3) a fused distance-computation mode that eliminates the O(BN M ) intermediate distance tensor, achieving up to 98% memory reduction compared to prior work. The library supports arbitrary sequence lengths, full PyTorch autograd integration, and Soft-DTW Barycenter computation. Code is available at https://github.com/BGU-CS-VIL/sdtw-cuda-torch.

연구 동기 및 목표

GPU 가속 SoftDTW를 1024 길이 제한 없이 구현한다.
작은 감마에서 역전 패스의 수치적 안정성을 향상시키기 위해 역전 패스를 로그 공간으로 재구성한다.
전체 거리 텐서의 물질화를 피해서 GPU 메모리 사용량을 줄인다.
전체 PyTorch autograd 호환성을 유지하고 Soft-DTW 바리센터 계산을 지원한다.

제안 방법

시퀀스 길이 제약을 제거하기 위해 각 앙티다이애그날마다 별도의 커널을 실행하는 타일링된 앤티다이애그날 전달 패스.
오버플로를 방지하기 위해 logsumexp를 사용하는 로그 공간 역전 패스. 역전 DP 이후 최종 exp를 적용.
메모리 사용을 O(BNM)에서 O(B(N+M))로 줄이기 위해 거리를 즉시 재계산하는 융합 거리 연산 모드.
DP 중 빠른 조회를 위해 전체 거리 텐서를 미리 계산하고 저장하는 비융합 모드.
그라디언트 기반 최적화(Adam)를 통한 SoftDTW 바리센터 계산 제공.

Figure 1 : Benchmark results for batch size $B=32$ . Top row: Peak GPU memory (MB) as a function of sequence length $L$ (left, $D=128$ ) and feature dimension $D$ (right, $L=256$ ). Bottom row: Wall-clock runtime (ms) for the corresponding configurations. Maghoumi’s implementation is unavailable for

실험 결과

연구 질문

RQ1SoftDTW를 GPU에서 임의로 긴 시퀀스에 대해 하드 길이 상한 없이 계산할 수 있는가?
RQ2작은 스무딩 매개변수에서 역전 패스를 로그 공간으로 재구성하면 수치적 안정성이 개선되는가?
RQ3거리를 즉시 융합 계산으로 메모리 사용량을 대폭 줄일 수 있는가, 그리고 실행 시간과의 트레이드오프는 무엇인가?
RQ4큰 데이터세트에 대해 PyTorch autograd 호환 SoftDTW가 바리센터 계산에 실용적인가?

주요 결과

제안된 타일링된 앤티다이애그날 실행은 1024 길이 제약을 제거하여 GPU에서 N,M > 1024를 가능하게 한다.
로그 공간 역전 패스는 작은 gamma 값에서 오버플로와 NaN를 방지하여 수치적 안정성을 향상시킨다.
융합 모드는 비융합에 비해 최대 40–98%의 메모리 절감을 달성하지만 런타임은 10–15배 느려진다.
비융합 모드는 여전히 더 빠르고 메모리 친화적이지만, GPU 메모리가 병목일 때는 융합 모드가 선호된다.
구현은 전체 PyTorch autograd 통합 및 Soft-DTW 바리센터 계산을 지원한다.

Figure 2 : SoftDTW Barycenter on synthetic block-wave data.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.