QUICK REVIEW

[论文解读] TeHOR: Text-Guided 3D Human and Object Reconstruction with Textures

Hyeongjin Nam, Daniel Sungho Jung|arXiv (Cornell University)|Feb 23, 2026

Multimodal Machine Learning Applications被引用 0

一句话总结

TeHOR 通过利用交互文本描述来引导语义对齐与外观先验，从单张图像共同重建带纹理的3D人类与对象，支持非接触推理并达到当前最先进的结果。

ABSTRACT

Joint reconstruction of 3D human and object from a single image is an active research area, with pivotal applications in robotics and digital content creation. Despite recent advances, existing approaches suffer from two fundamental limitations. First, their reconstructions rely heavily on physical contact information, which inherently cannot capture non-contact human-object interactions, such as gazing at or pointing toward an object. Second, the reconstruction process is primarily driven by local geometric proximity, neglecting the human and object appearances that provide global context crucial for understanding holistic interactions. To address these issues, we introduce TeHOR, a framework built upon two core designs. First, beyond contact information, our framework leverages text descriptions of human-object interactions to enforce semantic alignment between the 3D reconstruction and its textual cues, enabling reasoning over a wider spectrum of interactions, including non-contact cases. Second, we incorporate appearance cues of the 3D human and object into the alignment process to capture holistic contextual information, thereby ensuring visually plausible reconstructions. As a result, our framework produces accurate and semantically coherent reconstructions, achieving state-of-the-art performance.

研究动机与目标

推动超越物理接触线索的鲁棒3D人–物重建，解决如凝视或指向等非接触交互。
结合整体外观线索，捕捉HOI重建中的全局上下文与视觉合理性。
利用文本描述对联合3D重建与文本条件外观进行语义引导与约束。
生成人和对象的带纹理3D网格（高斯分布），以提供用于AR/VR和机器人应用的真实资产。

提出的方法

将3D人和对象表示为一组具有几何和外观属性的3D高斯分布。
使用可微渲染器（Mip-Splatting）将高斯投影到2D以进行优化。
通过视觉–语言模型（如GPT-4）从图像中提取整体与接触聚焦的文本提示。
通过最小化重建损失、通过文本条件的扩散先验分数蒸馏得到的外观损失、指定接触区域的接触损失以及碰撞惩罚，共同优化几何与纹理。
来自文本条件扩散先验（如StableDiffusion）的引导用于使渲染外观在多视角下与文本描述对齐。

实验结果

研究问题

RQ1文本描述的人–物交互是否能超越接触线索指引准确的3D重建？
RQ2通过扩散先验整合整体外观线索是否能提升HOI场景的纹理与空间对齐？
RQ3联合优化是否能在多视角下实现语义一致且对非接触情境友好的重建？
RQ4带纹理的高斯表示相较于网格表示在开放词汇HOI重建方面是否具有优势？
RQ5在非接触场景中，文本引导的优化相较于基于接触的方法的表现如何？

主要发现

在Open3DHOI和BEHAVE数据集上，在接触与非接触场景均达到最先进性能。
通过扩散先验的文本引导外观监督相较于仅依赖CLIP或纯几何线索的基线，提升了对象的Chamfer距离与接触F1分数。
具有外观先验的3D高斯表示在重建精度与优化效果方面优于基于网格的表示。
引入二维背景和文本条件使HOI重建更准确且全局一致。
结合文本引导的联合优化在3D重建与描述之间的语义对齐优于仅靠接触的方法。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。