QUICK REVIEW

[论文解读] SoftDTW-CUDA-Torch: Memory-Efficient GPU-Accelerated Soft Dynamic Time Warping for PyTorch

Ron Shapira Weber, Oren Freifeld|arXiv (Cornell University)|Feb 19, 2026

Time Series Analysis and Forecasting被引用 0

一句话总结

简要说明：一个内存高效的 PyTorch CUDA SoftDTW 实现，移除了 1024 的长度限制，使用对数空间的反向传播，并提供融合/非融合模式以在内存与速度之间权衡，支持 PyTorch autograd 和 Soft-DTW barycenter。

ABSTRACT

We present softdtw-cuda-torch, an open-source PyTorch library for computing Soft Dynamic Time Warping (SoftDTW) on GPUs. Our implementation addresses three key limitations of existing GPU implementations of SoftDTW: a hard sequence-length cap of 1024, numerical instability in the backward pass for small smoothing parameters, and excessive GPU memory consumption from materializing pairwise distance tensors. We introduce (1) tiled anti-diagonal kernel execution that removes the sequence-length constraint, (2) a log-space back-ward pass that prevents floating-point overflow, and (3) a fused distance-computation mode that eliminates the O(BN M ) intermediate distance tensor, achieving up to 98% memory reduction compared to prior work. The library supports arbitrary sequence lengths, full PyTorch autograd integration, and Soft-DTW Barycenter computation. Code is available at https://github.com/BGU-CS-VIL/sdtw-cuda-torch.

研究动机与目标

在不设定 1024 长度上限的前提下实现 GPU 加速的 SoftDTW。
在对 gamma 较小时改进反向传播的数值稳定性。
通过避免完整距离张量的显式构建来降低 GPU 内存使用。
保持对 PyTorch autograd 的完整兼容性，并支持 Soft-DTW barycenter 的计算。

提出的方法

分块对角线前向传播，通过对每个反对角线各自启动一个内核来移除序列长度约束。
对数空间反向传播，使用 logsumexp 以防止溢出，反向 DP 结束后再应用最终的 exp。
融合距离计算模式在，在运行时重新计算距离以将内存从 O(BNM) 降低到 O(B(N+M))。
非融合模式预先计算并存储完整距离张量以便在 DP 期间快速查找。
通过梯度优化（Adam）提供 SoftDTW barycenter 的计算。

Figure 1 : Benchmark results for batch size $B=32$ . Top row: Peak GPU memory (MB) as a function of sequence length $L$ (left, $D=128$ ) and feature dimension $D$ (right, $L=256$ ). Bottom row: Wall-clock runtime (ms) for the corresponding configurations. Maghoumi’s implementation is unavailable for

实验结果

研究问题

RQ1SoftDTW 是否可以在 GPU 上对任意长度序列进行计算，而不设定硬性的长度上限？
RQ2将反向传播改写为对数空间是否能在 gamma 较小时提升数值稳定性？
RQ3通过在运行时融合距离计算以减少内存，是否会带来明显的运行时权衡？
RQ4在大数据集上，基于 PyTorch autograd 的 SoftDTW 是否可行用于 barycenter 计算？

主要发现

所提出的分块对角线执行消除了 1024 长度的约束，使 GPU 上的 N, M 可以超过 1024。
对数空间的反向传播避免了小 gamma 值下的溢出和 NaN，提升数值稳定性。
融合模式相较于非融合在内存上可减少 40–98%，但运行时间提高 10–15 倍。
非融合模式仍然更快且对内存友好，而当 GPU 内存成为瓶颈时优先选择融合模式。
实现支持与 PyTorch autograd 的完整集成，以及 Soft-DTW barycenter 的计算。

Figure 2 : SoftDTW Barycenter on synthetic block-wave data.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。