QUICK REVIEW

[논문 리뷰] GlyphBanana: Advancing Precise Text Rendering Through Agentic Workflows

Zexuan Yan, Jiarui Jin|arXiv (Cornell University)|2026. 03. 12.

Generative Adversarial Networks and Image Synthesis인용 수 0

한 줄 요약

GlyphBanana는 확산 기반 이미지 생성에서 복잡한 텍스트와 수식의 정확한 렌더링을 가능하게 하기 위해 Glyph 템플릿을 잠재 공간과 어텐션 공간에 통합하는 학습 없는 에이전틱 워크플로우를 도입합니다. 또한 여러 언어 간에 간단한 단어를 다중 행 수식으로 렌더링을 평가하는 벤치마크인 GlyphBanana-Bench를 제시합니다.

ABSTRACT

Despite recent advances in generative models driving significant progress in text rendering, accurately generating complex text and mathematical formulas remains a formidable challenge. This difficulty primarily stems from the limited instruction-following capabilities of current models when encountering out-of-distribution prompts. To address this, we introduce GlyphBanana, alongside a corresponding benchmark specifically designed for rendering complex characters and formulas. GlyphBanana employs an agentic workflow that integrates auxiliary tools to inject glyph templates into both the latent space and attention maps, facilitating the iterative refinement of generated images. Notably, our training-free approach can be seamlessly applied to various Text-to-Image (T2I) models, achieving superior precision compared to existing baselines. Extensive experiments demonstrate the effectiveness of our proposed workflow. Associated code is publicly available at https://github.com/yuriYanZeXuan/GlyphBanana.

연구 동기 및 목표

텍스트-투-이미지 생성에서 드문 문자와 복잡한 수식의 신뢰할 수 있는 렌더링 필요성을 고무한다.
정확한 렌더링을 위해 시스템 폰트 글리프를 확산 모델과 융합하는 학습 없는 에이전틱 파이프라인을 제안한다.
수동 설계 개입 없이 임의의 스타일에 대한 자율적 적응을 가능하게 한다.
단순한 단어에서 복잡한 다중 행 수식에 이르는 스펙트럼에 걸쳐 텍스트 렌더링을 평가하기 위한 GlyphBanana-Bench를 소개한다.

제안 방법

네 단계 에이전틱 워크플로우: Extraction, Draft Preview, Glyph Injection, 및 Style Refinement.
Glyph Injection은 잠재 공간의 주파수 분해와 어텐션 재가중화를 결합하여 Glyph 템플릿을 DiT 블록에 주입한다.
Frequency Decomposition은 마스크를 통해 저주파/고주파 성분을 혼합하여 고주파 Glyph 디테일을 주입한다.
Attention Re-weighting은 DiT 자기 주의에서 바이어스 매트를 도입하여 특정 토큰을 glyph 템플릿 쪽으로 편향시킨다.
Iterative Refinement는 Style Refiner와 Score Judger를 사용하여 확산 기반 이미지-대-이미지 모델로 품질과 조화를 개선한다.

실험 결과

연구 질문

RQ1학습 없는 에이전틱 워크플로우가 확산 모델에서 복잡한 텍스트와 수식을 렌더링할 때 OCR 정확도와 시각적 충실도를 향상시킬 수 있는가?
RQ2잠재 공간 주파수 분해 및 어텐션 재가중화가 언어와 스타일에 걸친 정확한 글리프 렌더링에 어떻게 기여하는가?
RQ3GlyphBanana 파이프라인이 미세 조정 없이 다양한 DiT 백본에 대해 일반화되는가?
RQ4새로운 GlyphBanana-Benchmark가 T2I 시스템에서 OOV 텍스트 및 복잡한 수식을 평가하는 데 어떤 영향을 미치는가?

주요 결과

GlyphBanana는 GlyphBanana-Benchmark에서 기본값과 비교했을 때 렌더링된 텍스트에 대해 더 높은 OCR 및 VLM 기반 지표를 달성한다.
제거 연구는 Frequency Decomposition, Injection, 및 Iterative Refinement가 각각 텍스트 정밀도와 시각적 조화를 향상시키는 데 기여한다는 것을 보여준다.
이 방법은 다수의 확산 백본과의 학습 없는 통합을 제공하고 렌더링 정밀도 및 스타일 충실도에서 베이스라인을 능가한다.
GlyphBanana는 Z-Image 및 Qwen-Image 백본에서 OCR 점수를 상당한 폭으로 향상시킨다.
Iterative refinement는 실험 전반에 걸쳐 시각적 품질을 지속적으로 향상시키면서 텍스트 정확성을 보존한다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.