QUICK REVIEW

[논문 리뷰] AnyText: Multilingual Visual Text Generation And Editing

Yuxiang Tuo, Wangmeng Xiang|arXiv (Cornell University)|2023. 11. 06.

Computer Graphics and Visualization Techniques인용 수 8

한 줄 요약

AnyText는 보조 잠재 모듈과 OCR 정보 기반 텍스트 임베딩 모듈을 사용하여 이미지에 읽을 수 있는 텍스트를 렌더링하고 다국어 시각 텍스트 생성 및 편집을 위한 확산 기반 프레임워크이며, AnyWord-3M 데이터셋과 AnyText-benchmark를 도입합니다.

ABSTRACT

Diffusion model based Text-to-Image has achieved impressive achievements recently. Although current technology for synthesizing images is highly advanced and capable of generating images with high fidelity, it is still possible to give the show away when focusing on the text area in the generated image. To address this issue, we introduce AnyText, a diffusion-based multilingual visual text generation and editing model, that focuses on rendering accurate and coherent text in the image. AnyText comprises a diffusion pipeline with two primary elements: an auxiliary latent module and a text embedding module. The former uses inputs like text glyph, position, and masked image to generate latent features for text generation or editing. The latter employs an OCR model for encoding stroke data as embeddings, which blend with image caption embeddings from the tokenizer to generate texts that seamlessly integrate with the background. We employed text-control diffusion loss and text perceptual loss for training to further enhance writing accuracy. AnyText can write characters in multiple languages, to the best of our knowledge, this is the first work to address multilingual visual text generation. It is worth mentioning that AnyText can be plugged into existing diffusion models from the community for rendering or editing text accurately. After conducting extensive evaluation experiments, our method has outperformed all other approaches by a significant margin. Additionally, we contribute the first large-scale multilingual text images dataset, AnyWord-3M, containing 3 million image-text pairs with OCR annotations in multiple languages. Based on AnyWord-3M dataset, we propose AnyText-benchmark for the evaluation of visual text generation accuracy and quality. Our project will be open-sourced on https://github.com/tyxsspa/AnyText to improve and promote the development of text generation technology.

연구 동기 및 목표

다중 언어에 걸쳐 확산 기반 이미지 생성에서 읽기 쉽고 정확한 텍스트를 렌더링하는 도전 과제를 해결한다.
보조 잠재 및 OCR 정보를 활용한 텍스트 임베딩 모듈과 함께 텍스트를 이미지에 렌더링하고 편집하되 배경 스타일과 일치시키는 확산 기반 파이프라인을 제안한다.
대규모 다국어 텍스트-이미지 데이터셋 AnyWord-3M과 평가 벤치마크 AnyText-benchmark를 도입한다.
다국어 텍스트 생성에서 텍스트 정확도와 이미지 리얼리즘 측면에서 기존 방법보다 우수한 성능을 입증한다.

제안 방법

두 개의 조건부 구성 요소인 보조 잠 latent 모듈과 텍스트 임베딩 모듈을 갖춘 텍스트 제어 확산 파이프라인을 제안한다.
보조 잠재 모듈은 텍스트 자모, 위치 및 마스크된 이미지 영역을 자모, 위치, 마스크 입력을 통해 잠재 특징 맵으로 인코딩한다.
텍스트 임베딩 모듈은 자모를 렌더링하고 필기 정보를 인코드하기 위해 OCR 기반 임베딩(PP-OCRv3)을 사용하며 교차 주의(Cross-attention)와 트랜스포머를 통해 캡션 임베딩과 융합한다.
대상 텍스트 영역에서의 서술 정확도를 향상시키기 위해 텍스트 제어 확산 손실과 텍스트 지각 손실로 학습한다.
텍스트 생성에 초점을 맞추고 베이스 모델의 능력을 보존하는 TextControlNet을 바인딩하여 기존 확산 모델과의 Plug-and-Play 호환성을 가능하게 한다.
OCR 주석이 포함된 3.0M 이미지-텍스트 다국어 데이터세트 AnYWord-3M과 시각 텍스트 생성 평가를 위한 AnyText-benchmark를 제시한다.

실험 결과

연구 질문

RQ1확산 기반 모델이 지정된 위치와 영역(굴곡된/비정형 영역 포함)에서 다국어 읽기 가능한 텍스트를 이미지에 렌더링할 수 있는가?
RQ2이미지 내 텍스트 편집이 언어 간 글꼴 스타일과 정렬을 일관되게 수행할 수 있는가?
RQ3OCR 기반 자모 임베딩과 보조 잠재 모듈을 도입하면 다국어 텍스트 정확도와 시각 리얼리즘이 향상되는가?
RQ4텍스트 제어 확산 손실과 텍스트 지각 손실이 글쓰기 정확도와 전체 이미지 품질에 어떤 영향을 미치는가?

주요 결과

AnyText는 OCR 정확도(Sen. ACC, NED)와 리얼리즘(FID) 측면에서 AnyText-benchmark에서 영어 및 중국어 텍스트 생성 모두에서 경쟁 방법보다 우수한 성능을 보인다.
v1.1 모델은 영어 Sen. ACC 0.7239, 중국어 Sen. ACC 0.6923을 달성하였고 이전 방법들보다 NED와 FID가 향상되었다.
v1.0 모델은 이미 여러 베이스라인을 능가하여 배경(예: 석조 조각, 보드 간판)과의 강력한 통합된 텍스트를 보여준다.
제안된 OCR 기반 텍스트 임베딩 및 보조 잠재 모듈은 다중 행, 변형 영역, 다국어 텍스트 생성 및 편집을 가능하게 하며 비라틴 문자도 포함한다.
대규모 데이터세트 AnyWord-3M(3.0M 이미지-텍스트 쌍 및 OCR 주석)은 학습을 지원하고 AnyText-benchmark는 다국어 시각 텍스트 생성을 위한 표준화된 평가를 제공한다.
절차 연구(ablation)에서 OCR 기반 텍스트 임베딩, 명시적 위치 조건화, 텍스트 지각 손실 각각이 중국어 및 영어 텍스트 생성 정확도 향상에 기여한다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.