QUICK REVIEW

[论文解读] When Text-as-Vision Meets Semantic IDs in Generative Recommendation: An Empirical Study

Shutong Qiao, Wei Yuan|arXiv (Cornell University)|Jan 21, 2026

Recommender Systems and Techniques被引用 0

一句话总结

该研究以基于 OCR 的可视文本表示取代标准文本嵌入，以学习生成式推荐的语义 ID，在单模态与多模态设置中均显示出鲁棒的提升，尤其是在属性丰富的描述上。

ABSTRACT

Semantic ID learning is a key interface in Generative Recommendation (GR) models, mapping items to discrete identifiers grounded in side information, most commonly via a pretrained text encoder. However, these text encoders are primarily optimized for well-formed natural language. In real-world recommendation data, item descriptions are often symbolic and attribute-centric, containing numerals, units, and abbreviations. These text encoders can break these signals into fragmented tokens, weakening semantic coherence and distorting relationships among attributes. Worse still, when moving to multimodal GR, relying on standard text encoders introduces an additional obstacle: text and image embeddings often exhibit mismatched geometric structures, making cross-modal fusion less effective and less stable. In this paper, we revisit representation design for Semantic ID learning by treating text as a visual signal. We conduct a systematic empirical study of OCR-based text representations, obtained by rendering item descriptions into images and encoding them with vision-based OCR models. Experiments across four datasets and two generative backbones show that OCR-text consistently matches or surpasses standard text embeddings for Semantic ID learning in both unimodal and multimodal settings. Furthermore, we find that OCR-based Semantic IDs remain robust under extreme spatial-resolution compression, indicating strong robustness and efficiency in practical deployments.

研究动机与目标

评估将文本作为视觉表示用于 GR 中的语义 ID 学习的动机。
在单模态和多模态设置下，定量比较 OCR 基于文本表示与标准文本嵌入。
评估 OCR 编码器和渲染质量对 OCR 基本语义 ID 的鲁棒性。
分析在基于 OCR 表示的多模态语义 ID 构建中的融合策略。

提出的方法

将文本项描述渲染为图像，并用 OCR 模型编码以获得 OCR-文本嵌入。
在单模态和多模态 GR 流水线中将 OCR-文本嵌入整合到语义 ID 学习中。
使用 TIGER 和 LETTER 主干网络在早融和晚融方案下比较 OCR-文本与标准文本嵌入。
在四个数据集上通过留一序列推荐评估 Recall@K 和 NDCG@K。
通过改变 OCR 编码器和渲染图像分辨率来评估鲁棒性。

Figure 1 . Embedding geometry across modalities. We project three item representations into a shared 2D space: Item image emb , extracted from each item’s photos; OCR-based text emb, extracted by rendering the item’s textual description into an image and encoding it with an OCR model; and Standard t

实验结果

研究问题

RQ1RQ1: OCR 基于文本表示能否替代标准文本表示用于单模态语义 ID 学习？
RQ2RQ2: OCR 基于文本表示能否替代标准文本表示用于多模态语义 ID 学习？
RQ3RQ3: OCR 基于语义 ID 对 OCR 编码器和渲染质量的变化有多鲁棒？

主要发现

OCR-文本在单模态语义 ID 学习中常常达到甚至超过标准文本嵌入，且在属性密集的数据集上收益更大。
在多模态早融合中，OCR-文本在 Scientific 与 Instruments 上持续提升性能，而在 Arts 的提升较小，Luxury 的提升中等。
在晚融合下，OCR-文本仍然是可替换的可行方案，且在多数数据集和指标上通常带来稳定提升。
OCR-文本对渲染图像分辨率下降和不同 OCR 编码器（DeepSeek-OCR、Donut-base、TrOCR-base）具有较高鲁棒性。
针对数据集的分析显示，对属性风格密集描述的数据集收益更大，而对叙述风格描述的数据集收益较小。

Figure 2 . Conceptual illustration of representation spaces induced by different encoders.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。