[论文解读] SeqGAN: Sequence Generative Adversarial Nets with Policy Gradient
SeqGAN 将序列生成视为强化学习,使用 GAN 判别器作为端序奖励,并结合蒙特卡洛展开的策略梯度来训练离散令牌生成器。它在合成和真实序列任务上优于基线。
As a new way of training generative models, Generative Adversarial Nets (GAN) that uses a discriminative model to guide the training of the generative model has enjoyed considerable success in generating real-valued data. However, it has limitations when the goal is for generating sequences of discrete tokens. A major reason lies in that the discrete outputs from the generative model make it difficult to pass the gradient update from the discriminative model to the generative model. Also, the discriminative model can only assess a complete sequence, while for a partially generated sequence, it is non-trivial to balance its current score and the future one once the entire sequence has been generated. In this paper, we propose a sequence generation framework, called SeqGAN, to solve the problems. Modeling the data generator as a stochastic policy in reinforcement learning (RL), SeqGAN bypasses the generator differentiation problem by directly performing gradient policy update. The RL reward signal comes from the GAN discriminator judged on a complete sequence, and is passed back to the intermediate state-action steps using Monte Carlo search. Extensive experiments on synthetic data and real-world tasks demonstrate significant improvements over strong baselines.
研究动机与目标
- Address exposure bias and training/inference mismatch in sequence generation.
- Enable GAN-based training for discrete token sequences by avoiding differentiating through discrete outputs.
- Leverage a discriminator-derived reward with Monte Carlo rollouts to optimize a stochastic policy (generator).
- Demonstrate effectiveness on synthetic data and real tasks such as poetry, speech, and music generation.
提出的方法
- Model the sequence generator as a stochastic policy in reinforcement learning.
- Use a CNN-based discriminator to judge entire sequences and provide a reward signal.
- Apply Monte Carlo search to estimate the action-value Q for intermediate states.
- Optimize the generator with policy gradient (REINFORCE) using the discriminator reward (Eq. 9–11).
- Pre-train G with maximum likelihood and Alternate training between G and D (Algorithm 1).
- Utilize roll-out policy Gβ for N samples to estimate intermediate rewards (Eq. 4–7).
实验结果
研究问题
- RQ1Can GANs be effectively applied to discrete sequence generation through reinforcement learning without passing gradients through discrete outputs?
- RQ2Does discriminator-guided policy optimization improve the quality of generated sequences compared to MLE, scheduled sampling, and BLEU-guided PG baselines?
- RQ3How does SeqGAN perform on synthetic distributions and real-world sequence tasks like poetry, speech, and music generation?
主要发现
| Algorithm | NLL | p-value |
|---|---|---|
| Random | 10.310 | <10^{-6} |
| MLE | 9.038 | <10^{-6} |
| SS | 8.985 | <10^{-6} |
| PG-BLEU | 8.946 | <10^{-6} |
| SeqGAN | 8.736 | <10^{-6} |
- SeqGAN significantly outperforms baseline methods (MLE, scheduled sampling, PG-BLEU) on synthetic data in terms of lower NLL oracle scores.
- SeqGAN yields substantial improvements over baselines in real-world tasks such as Chinese poem generation, Obama speeches, and music generation, including BLEU and human judgments.
- The training strategy (g-steps, d-steps, and roll-out size k) affects stability and convergence, with certain configurations yielding stable, superior performance.
- Discriminator-based rewards provide a more general guidance signal than task-specific metrics like BLEU in guiding sequence generation.
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。