QUICK REVIEW

[论文解读] EmoKGEdit: Training-free Affective Injection via Visual Cue Transformation

Jing Zhang, Bingjie Fan|arXiv (Cornell University)|Jan 18, 2026

Generative Adversarial Networks and Image Synthesis被引用 0

一句话总结

EmoKGEdit 通过使用多模态情感关联知识图谱（MSA-KG）引导区域聚焦的情感注入，同时保持结构，从而实现无训练成本的精确图像情感编辑。它结合情感定位模块、知识图谱支撑的cue迁移和解耦的结构-情感编辑流程，在情感保真度与内容保持之间实现平衡。

ABSTRACT

Existing image emotion editing methods struggle to disentangle emotional cues from latent content representations, often yielding weak emotional expression and distorted visual structures. To bridge this gap, we propose EmoKGEdit, a novel training-free framework for precise and structure-preserving image emotion editing. Specifically, we construct a Multimodal Sentiment Association Knowledge Graph (MSA-KG) to disentangle the intricate relationships among objects, scenes, attributes, visual clues and emotion. MSA-KG explicitly encode the causal chain among object-attribute-emotion, and as external knowledge to support chain of thought reasoning, guiding the multimodal large model to infer plausible emotion-related visual cues and generate coherent instructions. In addition, based on MSA-KG, we design a disentangled structure-emotion editing module that explicitly separates emotional attributes from layout features within the latent space, which ensures that the target emotion is effectively injected while strictly maintaining visual spatial coherence. Extensive experiments demonstrate that EmoKGEdit achieves excellent performance in both emotion fidelity and content preservation, and outperforms the state-of-the-art methods.

研究动机与目标

Motivate precise, region-aware image emotion editing that preserves semantic layout.
Disentangle emotional modulation from structural content during editing.
Leverage external knowledge to constrain and guide emotion-related visual cue injection.
Support interpretable, CoT-guided editing via a knowledge graph.
Demonstrate superior emotion fidelity and content preservation over state-of-the-art methods.

提出的方法

Localize emotion-causing regions via an Emotion Region-aware module to constrain edits to foreground areas.
Construct Multimodal Sentiment Association Knowledge Graph (MSA-KG) linking scenes, objects, attributes, and emotions.
Use Emotions Cue Transfer Module with Chain-of-Thought reasoning to convert cues into executable editing instructions grounded by MSA-KG.
Apply Disentangled Structure–Emotion Editing with dual diffusion paths, each with dedicated constraints, to inject emotion while preserving geometry.

Figure 1 : Inspired by cognitive psychology on visual emotion processing, extracting emotion-inducing regions as scene cores and coupling them with their corresponding objects. Through this region–object coupling, the model focuses on the most emotionally salient content.

实验结果

研究问题

RQ1Can region-focused editing guided by a multimodal knowledge graph achieve precise emotion injection while maintaining content fidelity?
RQ2Does the combination of CoT-driven cue transfer and disentangled editing improve both emotional accuracy and structural preservation compared to baselines?
RQ3How does MSA-KG grounding affect the plausibility and coherence of edited images?
RQ4What is the contribution of each module (ERA, ECT, DSEE) to overall performance?
RQ5Is training-free editing feasible for multi-emotion scenarios with maintained semantics?

主要发现

Structure	Semantic	Emotion	Method	SSIM ↑	AesScore ↑	CLIP-I Prox ↑	Semantic-C ↑
Instruct-pix2pix	0.3987	5.1174	0.3182	0.5950	0.1727	0.5893	0.1653
Qwen-Image-Edit	0.3594	5.3333	0.3740	0.6210	0.2426	0.6477	0.2159
AIF	0.3591	4.4810	0.5555	0.4630	0.1230	0.5044	0.1259
EmoEditor	0.3757	4.8638	0.5440	0.5280	0.2324	0.6328	0.2035
EmoEdit	0.3455	5.1380	0.3701	0.6330	0.3211	0.6657	0.2779
Ours	0.4204	5.6440	0.5774	0.6470	0.4452	0.8819	0.3179

EmoKGEdit achieves superior performance across structure, semantics, and emotion metrics versus baselines.
It substantially improves Emo_Acc8 (0.4452) and Emo_Acc2 (0.8819) and TEA (0.3179) compared with EmoEdit (Emo_Acc8 0.3211, Emo_Acc2 0.2779, TEA 0.2779).
The ablation shows ERA boosts TEA and SSIM, while adding ECT+DSEE yields best balance with higher semantic-C and TEA despite minor SSIM trade-off.
User studies rank EmoKGEdit highest across structural similarity, semantic plausibility, emotion activation, and aesthetics.
Qualitative results demonstrate content-faithful, localized emotion injection that avoids global over-editing.

Figure 2 : Overview of EmoKGEdit. The proposed framework comprises four components: Multimodal Sentiment Association Knowledge Graph(MSA Knowledge Graph), Emotion Region-aware Module, Emotion Cue Transfer Module, and Disentangled Structure–Emotion Editing(DSEE) Module.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。