QUICK REVIEW

[论文解读] DiffTaichi: Differentiable Programming for Physical Simulation

Yuanming Hu, Luke Anderson|arXiv (Cornell University)|Oct 1, 2019

Model Reduction and Neural Networks参考文献 35被引用 75

一句话总结

DiffTaichi 引入了一个面向高性能物理仿真的可微编程语言，采用两尺度自动微分系统和 megakernel 融合实现端到端梯度计算。

ABSTRACT

We present DiffTaichi, a new differentiable programming language tailored for building high-performance differentiable physical simulators. Based on an imperative programming language, DiffTaichi generates gradients of simulation steps using source code transformations that preserve arithmetic intensity and parallelism. A light-weight tape is used to record the whole simulation program structure and replay the gradient kernels in a reversed order, for end-to-end backpropagation. We demonstrate the performance and productivity of our language in gradient-based learning and optimization tasks on 10 different physical simulators. For example, a differentiable elastic object simulator written in our language is 4.2x shorter than the hand-engineered CUDA version yet runs as fast, and is 188x faster than the TensorFlow implementation. Using our differentiable programs, neural network controllers are typically optimized within only tens of iterations.

研究动机与目标

在机器学习和机器人领域阐明需要高性能的可微分物理仿真器。
提供一种语言设计，在保留算术强度和并行性的同时实现可微分性。
通过两尺度系统和跨 10 个仿真器的轻量级 tape 展示自动微分。

提出的方法

在 Taichi 上扩展一个 Python 前端，编译为 Taichi IR，并通过源代码转换对前向内核进行可微分。
采用两尺度 AD：在内核内使用 SCT 的局部 AD，以及通过轻量级内核 tape 的全局 AD。
制定全局数据访问规则，以确保在命令式就地计算中的梯度定义良好。
使用 megakernel 融合来提高算术强度，并高效地将并行循环映射到 CPU/GPU。
提供工具通过 ti.complex_kernel 装饰器自定义复杂内核的梯度。
展示在连续介质力学、流体和刚体等领域的可微分仿真器，并给出性能基准。

实验结果

研究问题

RQ1如何在现代硬件上构建可微分的物理仿真器，以保持性能与并行性？
RQ2两尺度 AD 方法（内核内 SCT 的局部 AD 加上端到端 tape）是否能为复杂仿真器提供速度与灵活性？
RQ3在命令式高性能仿真中，需要哪些设计模式和规则（如全局数据访问规则）以确保梯度传播正确？
RQ4在代码规模、速度和跨不同仿真器的可扩展性方面，DiffTaichi 与现有的可微分编程工具相比如何？

主要发现

Approach	Forward Time	Backward Time	Total Time	# Lines of Code
TensorFlow	13.20 ms	35.70 ms	48.90 ms (188×)	190
CUDA	0.10 ms	0.14 ms	0.24 ms (0.92×)	460
DiffTaichi	0.11 ms	0.15 ms	0.26 ms (1.00×)	110

启用 DiffTaichi 的可微分仿真器的生产力显著提高，例如一个可微分弹性对象仿真器的代码量比手工调优的 CUDA 版本少 4.2×。
同一仿真器在弹性对象示例中运行速度与手工调优的 CUDA 版本相同，且比 TensorFlow 快 188×。
在 10 个仿真器中，DiffTaichi 实现了高性能，梯度通过两尺度 AD 系统高效生成。
轻量级 tape 记录内核启动并用于端到端反向传播，避免了大量中间缓冲。
TOI (time of impact) 基于连续碰撞处理在控制器优化中显著提高梯度质量。
在代表性测试中，与 TensorFlow、Autograd、PyTorch 和 JAX 相比，代码规模和性能均有提升。
该方法在典型场景中使基于学习的控制在几十次迭代内收敛。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。