QUICK REVIEW

[논문 리뷰] When Text-as-Vision Meets Semantic IDs in Generative Recommendation: An Empirical Study

Shutong Qiao, Wei Yuan|arXiv (Cornell University)|2026. 01. 21.

Recommender Systems and Techniques인용 수 0

한 줄 요약

이 연구는 표준 텍스트 임베딩을 OCR 기반 시각 텍스트 표현으로 대체하여 Generative Recommendation의 Semantic IDs를 학습하며, 단일 모드 및 멀티모달 설정에서 특히 속성 중심 설명에서 강력한 이점을 보임.

ABSTRACT

Semantic ID learning is a key interface in Generative Recommendation (GR) models, mapping items to discrete identifiers grounded in side information, most commonly via a pretrained text encoder. However, these text encoders are primarily optimized for well-formed natural language. In real-world recommendation data, item descriptions are often symbolic and attribute-centric, containing numerals, units, and abbreviations. These text encoders can break these signals into fragmented tokens, weakening semantic coherence and distorting relationships among attributes. Worse still, when moving to multimodal GR, relying on standard text encoders introduces an additional obstacle: text and image embeddings often exhibit mismatched geometric structures, making cross-modal fusion less effective and less stable. In this paper, we revisit representation design for Semantic ID learning by treating text as a visual signal. We conduct a systematic empirical study of OCR-based text representations, obtained by rendering item descriptions into images and encoding them with vision-based OCR models. Experiments across four datasets and two generative backbones show that OCR-text consistently matches or surpasses standard text embeddings for Semantic ID learning in both unimodal and multimodal settings. Furthermore, we find that OCR-based Semantic IDs remain robust under extreme spatial-resolution compression, indicating strong robustness and efficiency in practical deployments.

연구 동기 및 목표

GR에서 Semantic ID 학습을 위한 텍스트-비전 표현의 평가를 동기화한다.
단일 모듈 및 멀티모달 설정에서 OCR 기반 텍스트 표현과 표준 텍스트 임베딩을 정량적으로 비교한다.
OCR 인코더 및 렌더링 품질에 대한 OCR 기반 Semantic IDs의 강인성을 평가한다.
OCR 기반 표현 하에서 다중 모달 Semantic ID 구성의 융합 전략을 분석한다.

제안 방법

텍스트 아이템 설명을 이미지로 렌더링하고 OCR 모델로 인코딩하여 OCR-text 임베딩을 얻는다.
Semantic ID 학습에 OCR-text 임베딩을 단일 모듈 및 멀티모달 GR 파이프라인 전반에 통합한다.
TIGER 및 LETTER 백본과 조기/후기 융합 스키마를 사용하여 OCR-text와 표준 텍스트 임베딩을 비교한다.
네 가지 데이터셋에서 Leave-one-out 순차 추천하에 Recall@K 및 NDCG@K를 평가한다.
다양한 OCR 인코더와 렌더링된 이미지 해상도를 변화시켜 강건성을 평가한다.

Figure 1 . Embedding geometry across modalities. We project three item representations into a shared 2D space: Item image emb , extracted from each item’s photos; OCR-based text emb, extracted by rendering the item’s textual description into an image and encoding it with an OCR model; and Standard t

실험 결과

연구 질문

RQ1RQ1: OCR 기반 텍스트 표현이 단일 모듈 Semantic ID 학습에서 표준 텍스트 표현을 대체할 수 있는가?
RQ2RQ2: OCR 기반 텍스트 표현이 다중 모달 Semantic ID 학습에서 표준 텍스트 표현을 대체할 수 있는가?
RQ3RQ3: OCR 인코더 및 렌더링 품질의 변화에 대해 OCR 기반 Semantic ID의 강건성은 어떠한가?

주요 결과

OCR-text는 속성 밀집 데이터셋에서 더 큰 이점을 보이며, 단일 모듈 Semantic ID 학습에서 표준 텍스트 임베딩과 종종 동일하거나 그 이상으로 성능을 달성한다.
멀티모달 초기 융합에서 OCR-text는 Scientific 및 Instruments에서 일관되게 성능을 향상시키는 반면, Arts의 이점은 더 작고 Luxury의 이점은 중간이다.
후기 융합하에서 OCR-text는 여전히 실행 가능한 대체임을 보이며, 데이터셋과 지표에 따라 일관된 이득을 제공하는 경우가 많다.
OCR-text의 강건성은 렌더링된 이미지 해상도가 낮아져도 높고, DeepSeek-OCR, Donut-base, TrOCR-base와 같은 서로 다른 OCR 인코더에도 강력하다.
데이터셋별 분석은 속성 스타일의 설명이 밀집된 데이터셋에서 더 큰 이점을 보이고, 서술형 설명에서는 이점이 작아지는 경향이 있다.

Figure 2 . Conceptual illustration of representation spaces induced by different encoders.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.