QUICK REVIEW

[论文解读] Sketch-pix2seq: a Model to Generate Sketches of Multiple Categories

Yajing Chen, Shikui Tu|arXiv (Cornell University)|Sep 13, 2017

Advanced Image and Video Retrieval Techniques参考文献 9被引用 45

一句话总结

该论文提出 Sketch-pix2seq，一种基于 VAE 的模型，通过用 CNN 替代 sketch-rnn 中的 RNN 编码器并移除 KL 散度惩罚项，提升了多类别草图生成的质量。该模型生成的草图质量更高、类别准确性更强，并能实现跨不同类别的创造性插值，在人类相似度和多类别结构一致性方面优于先前方法。

ABSTRACT

Sketch is an important media for human to communicate ideas, which reflects the superiority of human intelligence. Studies on sketch can be roughly summarized into recognition and generation. Existing models on image recognition failed to obtain satisfying performance on sketch classification. But for sketch generation, a recent study proposed a sequence-to-sequence variational-auto-encoder (VAE) model called sketch-rnn which was able to generate sketches based on human inputs. The model achieved amazing results when asked to learn one category of object, such as an animal or a vehicle. However, the performance dropped when multiple categories were fed into the model. Here, we proposed a model called sketch-pix2seq which could learn and draw multiple categories of sketches. Two modifications were made to improve the sketch-rnn model: one is to replace the bidirectional recurrent neural network (BRNN) encoder with a convolutional neural network(CNN); the other is to remove the Kullback-Leibler divergence from the objective function of VAE. Experimental results showed that models with CNN encoders outperformed those with RNN encoders in generating human-style sketches. Visualization of the latent space illustrated that the removal of KL-divergence made the encoder learn a posterior of latent space that reflected the features of different categories. Moreover, the combination of CNN encoder and removal of KL-divergence, i.e., the sketch-pix2seq model, had better performance in learning and generating sketches of multiple categories and showed promising results in creativity tasks.

研究动机与目标

解决在同时学习多个类别时，如 sketch-rnn 模型中所见的草图生成质量下降的问题。
通过用 CNN 编码器替代 RNN 编码器，提升草图生成质量，以更好地捕捉结构特征。
探究在 VAE 目标函数中移除 KL 散度惩罚项是否能增强潜在空间中类别特定表征的解耦。
评估模型通过潜在空间插值在不同类别之间生成创造性草图的能力。
通过生成来自风格不同但语义特征一致的卡通风格输入的草图，测试模型的泛化能力。

提出的方法

将 sketch-rnn 中的双向 RNN 编码器替换为卷积神经网络（CNN），以更好地捕捉草图的局部结构特征。
从 VAE 目标函数中移除 Kullback-Leibler（KL）散度项，以避免强制潜在空间服从共享的高斯先验。
在变分自编码器（VAE）框架下，使用 QuickDraw 数据集中笔画序列数据进行模型训练。
利用潜在空间插值，通过线性组合不同类别的潜在编码，生成新颖的草图。
通过人类图灵测试和生成草图的定性分析评估模型性能。
通过输入卡通草图并评估输出在风格和语义一致性方面的表现，测试模型的泛化能力。

实验结果

研究问题

RQ1用 CNN 编码器替代 RNN 编码器是否能提升多类别设置下生成草图的质量和类别准确性？
RQ2从 VAE 目标函数中移除 KL 散度惩罚项是否能提升潜在空间中类别特定特征的解耦？
RQ3该模型能否通过类别之间的潜在空间插值生成合理且具有创造性的草图？
RQ4该模型在处理非真实草图的输入（如卡通形象）时是否具有良好泛化能力，能否保持风格和语义特征的一致性？
RQ5具有和不具有 KL 散度项的模型在潜在空间结构上（如聚类和类别分离）有何差异？

主要发现

使用 CNN 编码器的模型在生成类人风格草图方面优于基于 RNN 的模型，在图灵测试中表现出更高的质量与更好的类别准确性。
移除 KL 散度项后，潜在空间结构更清晰，类别聚类更明显，减少了错误或混合类别的草图生成。
在无 KL 散度的模型中，潜在空间插值产生了稳定且可解释的结果，例如具有猫特征的卡车或具有巴士身体的兔子。
CNN-KL 模型成功生成了训练数据中不存在的新颖草图，例如脸上带轮子的猫或头部为兔子的车辆，展现出强大的创造性潜力。
该模型在处理卡通输入时表现出良好泛化能力，生成的草图保留了关键风格特征（如耳朵形状和面部表情），即使输入为风格化且非照片级的图像。
可视化结果显示，带有 KL 散度的模型产生散乱且混合的潜在空间，而无 KL 散度的模型则形成清晰、类别分离的聚类，解释了性能的提升。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。