Skip to main content
QUICK REVIEW

[论文解读] Think Bright, Diffuse Nice: Enhancing T2I-ICL via Inductive-Bias Hint Instruction and Query Contrastive Decoding

Zhiyong Ma, Zhenpeng Li|arXiv (Cornell University)|Jan 7, 2026
Generative Adversarial Networks and Image Synthesis被引用 0
一句话总结

TBDN is a training-free framework for Text-to-Image In-Context Learning (T2I-ICL) that combines Hint Instruction (HI) and Query Contrastive Decoding (QCD) to reduce compliance failures and priordominated hallucinations, achieving state-of-the-art results across multiple benchmarks without training.

ABSTRACT

Text-to-Image In-Context Learning (T2I-ICL) enables customized image synthesis via interleaved text-image examples but faces two mutually reinforcing bottlenecks, compliance failure and prior-dominated hallucination, that form a vicious cycle degrading generation quality. Existing methods rely on tailored training, which limits flexibility and raises deployment costs. To address these challenges effectively, we propose TBDN, a training-free framework integrating two complementary closed-loop mechanisms: Hint Instruction (HI) and Query Contrastive Decoding (QCD). HI injects task-aware inductive bias via lightweight prompt engineering to anchor models on contextual mapping rules, thereby mitigating compliance failure. QCD adjusts the decoding distributions of language models by contrasting full-input and query-omitted distributions, suppressing prior-dominated hallucination. TBDN achieves State-of-the-Art performance on CoBSAT and Text-to-Image Fast Mini-ImageNet, with robust generalization across model backbones, prompt designs, and hyperparameters. It also maintains promising performance in concept preservation and prompt following on Dreambench++. By breaking the two bottlenecks, TBDN establishes a simple yet effective framework for efficient and reliable T2I-ICL.

研究动机与目标

  • Identify two core bottlenecks in T2I-ICL: compliance failure and prior-dominated hallucination.
  • Propose a training-free framework TBDN that combines HI and QCD to address these bottlenecks.
  • Demonstrate robustness and generalization of TBDN across LVLM backbones, prompts, and hyperparameters.
  • Show state-of-the-art performance on CoBSAT and Text-to-Image Fast Mini-ImageNet with training-free deployment.

提出的方法

  • Introduce Hint Instruction (HI): a prompt-based inductive-bias mechanism that prioritizes the final query to anchor mapping-rule reasoning.
  • Introduce Query Contrastive Decoding (QCD): a decoding strategy that contrasts full-input vs query-omitted distributions to suppress priors and align with the input context.
  • Describe a five-stage TBDN workflow: Pre-processing, Injection of HI, Reasoning by LVLM, Decoding with P_sub and P_full via QCD, and Diffusion-based image synthesis.
  • Formulate QCD distributions as P_full = ∏ pθ(y_t | X_ins, X_con, X_que, y_<t) and P_sub = ∏ pθ(y_t | X_ins, X_con, y_<t); Y is drawn from P_qcd ∝ softmax((1+α)·P_full − α·P_sub).
  • Show integration with diffusion models to translate LVLM outputs into high-fidelity images.
  • Compare against baselines (e.g., SEED-LLaMA, SEED-X, Emu, GILL, ThinkDiff) and ablate HI and QCD across multiple LVLM backbones and prompts.

实验结果

研究问题

  • RQ1Can HI mitigate compliance failure by injecting task-aware inductive bias toward the final query?
  • RQ2Can QCD suppress prior-dominated hallucination by contrasting full-input and query-omitted decoding distributions?
  • RQ3Do HI and QCD provide complementary gains, and is TBDN training-free across LVLM backbones and prompts?
  • RQ4What is TBDN’s performance relative to state-of-the-art on CoBSAT, Text-to-Image Fast Mini-ImageNet, and Dreambench++?
  • RQ5How do HI and QCD compare to other instruction templates in terms of efficiency and token overhead?

主要发现

  • TBDN achieves state-of-the-art results on CoBSAT and Text-to-Image Fast Mini-ImageNet across 2-shot and 4-shot settings.
  • Base (Q2) and Base (I3) pipelines outperform ThinkDiff without additional modality alignment.
  • Ablations show HI and QCD provide consistent improvements; combining both yields the strongest results.
  • TBDN is training-free and demonstrates robust generalization across LVLM backbones, prompts, and hyperparameters.
  • On Dreambench++, TBDN shows promising prompt-following but balanced with some limits in concept preservation due to fixed visual generator.
  • HI generally improves background and action-related tasks, while QCD strengthens object/attribute inferences; together they form a complementary loop.
  • Compared with instruction variants (CB-Ins, CoT-Ins, TD-Ins, TD-Ins++), HI provides the best balance of effectiveness and efficiency with moderate token cost.]

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。