QUICK REVIEW

[论文解读] Towards Gold-Standard Depth Estimation for Tree Branches in UAV Forestry: Benchmarking Deep Stereo Matching Methods

Yida Lin, Bing Xue|arXiv (Cornell University)|Jan 27, 2026

Advanced Vision and Imaging被引用 0

一句话总结

该论文系统性地评估八种深度立体匹配方法（零-shot、在 Scene Flow 上的预训练）在标准基准和新颖的 Tree Branches UAV 森林数据集上的表现，以识别鲁棒方法并将 DEFOM 确立为植物深度估计的黄金基线。

ABSTRACT

Autonomous UAV forestry operations require robust depth estimation with strong cross-domain generalization, yet existing evaluations focus on urban and indoor scenarios, leaving a critical gap for vegetation-dense environments. We present the first systematic zero-shot evaluation of eight stereo methods spanning iterative refinement, foundation model, diffusion-based, and 3D CNN paradigms. All methods use officially released pretrained weights (trained on Scene Flow) and are evaluated on four standard benchmarks (ETH3D, KITTI 2012/2015, Middlebury) plus a novel 5,313-pair Canterbury Tree Branches dataset ($1920 imes 1080$). Results reveal scene-dependent patterns: foundation models excel on structured scenes (BridgeDepth: 0.23 px on ETH3D; DEFOM: 4.65 px on Middlebury), while iterative methods show variable cross-benchmark performance (IGEV++: 0.36 px on ETH3D but 6.77 px on Middlebury; IGEV: 0.33 px on ETH3D but 4.99 px on Middlebury). Qualitative evaluation on the Tree Branches dataset establishes DEFOM as the gold-standard baseline for vegetation depth estimation, with superior cross-domain consistency (consistently ranking 1st-2nd across benchmarks, average rank 1.75). DEFOM predictions will serve as pseudo-ground-truth for future benchmarking.

研究动机与目标

为林业无人机自动修剪提供厘米级深度精度的动机。
在零-shot 条件下评估八种深度立体方法的跨域泛化能力。
确定一种鲁棒的方法以生成植物深度基准的伪地面 truth。
确立 DEFOM 作为林业深度估计数据集的黄金标准基线。

提出的方法

使用预训练的 Scene Flow 权重，评估涵盖迭代细化、基础模型、扩散和3D-CNN 架构的八种深度立体方法。
在 KITTI 2012/2015、ETH3D、Middlebury 以及新数据集 Tree Branches（新西兰 Canterbury）上进行零-shot 推理。
使用 EPE 和 D1 误差指标比较方法的精度与失败率。
基于跨域一致性，在植物场景中选择最鲁棒的方法以生成伪地面 truth。
对植物场景中的深度图进行定性分析，以评估其在无人机修剪任务中的适用性。

Figure 1 : Initial screening of 20 stereo matching methods using officially released pretrained weights on KITTI 2015 (D1-all %) and Middlebury (Average Absolute Error, pixels). Foundation models (DEFOM: 0.79% D1, BridgeDepth: 1.01% D1) dominate KITTI 2015, while iterative methods (IGEV++: 0.97 px A

实验结果

研究问题

RQ1不同深度立体范式（迭代、基础、扩散、3D-CNN）在零-shot 条件下对植物密集环境的泛化能力如何？
RQ2在不进行微调的情况下，哪种方法在林业相似场景上具有最鲁棒的跨域性能？
RQ3是否可以将基于基础模型的方法作为树枝深度估计伪地面 truth 的黄金标准？
RQ4在林业应用的基准中，精度（EPE）与失败率（D1）之间有哪些权衡？

主要发现

基础模型方法（DEFOM、BridgeDepth）在跨域一致性方面表现突出，在各基准中名列前茅。
DEFOM 在 KITTI 2015 和 Middlebury 上提供了平衡的性能，普遍排名在前1-2位，并且在所有基准中的一致性较高（平均名次1.75）。
BridgeDepth 在 ETH3D 和 KITTI 上表现突出，但在 Middlebury 由于较大差异而崩溃，表明对极端差异的泛化能力有限。
迭代方法（RAFT-Stereo、IGEV、IGEV++）提供稳定但参差不齐的跨域结果，IGEV++ 在 Middlebury D1 取得最佳，但并非普遍 superiority。
经典3D-CNN（ACVNet、PSMNet）在跨域表现不佳，发生灾难性失败，凸显在跨域林业任务中需要现代化架构。
DEFOM 被选定为 Tree Branches 数据集的黄金基线，用于在没有 LiDAR 的情况下进行伪地面 truth 基准测试。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。