Skip to main content
QUICK REVIEW

[论文解读] GFlowPO: Generative Flow Network as a Language Model Prompt Optimizer

Junmo Cho, Suhan Kim|arXiv (Cornell University)|Feb 3, 2026
Topic Modeling被引用 0
一句话总结

GFlowPO 将提示优化形式化为后验推断,并使用离策略的 GFlowNet 训练加动态记忆更新来高效发现适用于各种语言模型和任务的高奖励提示。

ABSTRACT

Finding effective prompts for language models (LMs) is critical yet notoriously difficult: the prompt space is combinatorially large, rewards are sparse due to expensive target-LM evaluation. Yet, existing RL-based prompt optimizers often rely on on-policy updates and a meta-prompt sampled from a fixed distribution, leading to poor sample efficiency. We propose GFlowPO, a probabilistic prompt optimization framework that casts prompt search as a posterior inference problem over latent prompts regularized by a meta-prompted reference-LM prior. In the first step, we fine-tune a lightweight prompt-LM with an off-policy Generative Flow Network (GFlowNet) objective, using a replay-based training policy that reuses past prompt evaluations to enable sample-efficient exploration. In the second step, we introduce Dynamic Memory Update (DMU), a training-free mechanism that updates the meta-prompt by injecting both (i) diverse prompts from a replay buffer and (ii) top-performing prompts from a small priority queue, thereby progressively concentrating the search process on high-reward regions. Across few-shot text classification, instruction induction benchmarks, and question answering tasks, GFlowPO consistently outperforms recent discrete prompt optimization baselines.

研究动机与目标

  • 由于组合式提示空间和稀疏奖励带来的自动提示优化动机。
  • 将提示搜索表述为由元提示先验正则化的后验推断。
  • 开发一个对提示样本高效的离策略 GFlowNet 训练机制。
  • 引入 Dynamic Memory Update (DMU) 以自适应地将搜索集中在高奖励区域。
  • 在多样任务和 LM 对上展示鲁棒性。

提出的方法

  • 定义对提示的后验 p(z|D,M) 与 p(D|z) p_ref(z|M) 成正比。
  • 用回放策略对一个轻量级提示-LM 进行离策略 GFlowNet 目标的微调。
  • 使用 VarGrad 基于全局路径一致性损失进行 GFlowNet 训练,并进行回放缓冲区采样。
  • 用训练准确度 A_D(z) 替代似然度以更好地与测试性能相关。
  • 通过 Dynamic Memory Update 更新元提示 M:从回放缓冲区混合提示以及一个小的高奖励缓冲区。
  • 在文本分类、指令诱导和问答等任务及多对提示-LM/目标-LM 对上进行评估。
Figure 1 : Concepts. Blue contour indicates high performing prompt regions. (a) Existing on-policy RL frameworks fail to explore the huge combinatorial search space with poor sample efficiency. (b) Our GFlowPO that can sample efficiently explore the search space by gradually annealing the posterior
Figure 1 : Concepts. Blue contour indicates high performing prompt regions. (a) Existing on-policy RL frameworks fail to explore the huge combinatorial search space with poor sample efficiency. (b) Our GFlowPO that can sample efficiently explore the search space by gradually annealing the posterior

实验结果

研究问题

  • RQ1提示搜索是否可以有效地被表述为带有元提示先验的对提示的后验推断?
  • RQ2相较于自适应强化学习方法,离策略 GFlowNet 训练是否提高了发现高奖励提示的样本效率?
  • RQ3训练无关的 Dynamic Memory Update (DMU) 是否在迭代中有效聚焦在高奖励区域?
  • RQ4GFlowPO 在少量样本和指令诱导设置下,与不同任务和 LM 组合相比表现如何?

主要发现

方法SST-2MRPCRTEQNLIMNLISNLI平均值
Fine-Tuning71.959.655.763.141.164.859.3
Soft prompt tuning78.357.151.689.034.955.861.1
Fixed prompt Manual Prompt89.151.064.073.067.047.065.2
Zero-shot CoT57.938.481.675.271.166.365.1
Few-shot prompt55.049.076.082.058.052.262.0
Discrete Prompt Tuning GrIPS84.7 ± 4.655.6 ± 2.660.9 ± 3.528.9 ± 1.244.4 ± 1.163.5 ± 2.359.4
PromptBoosting65.4 ± 1.052.7 ± 1.171.6 ± 0.971.6 ± 1.135.5 ± 1.452.6 ± 1.858.2
APE83.2 ± 7.755.3 ± 4.978.6 ± 1.375.0 ± 2.254.6 ± 7.972.3 ± 4.870.1
ProTeGi69.2 ± 8.448.8 ± 1.373.2 ± 6.374.2 ± 7.756.6 ± 10.961.3 ± 12.364.0
RLprompt70.8 ± 6.556.0 ± 1.567.3 ± 2.562.6 ± 1.354.6 ± 1.956.6 ± 1.361.3
StablePrompt92.5 ± 1.371.3 ± 3.481.5 ± 2.875.9 ± 1.463.3 ± 1.274.1 ± 1.476.4
GFlowPO93.0 ± 0.669.6 ± 4.282.0 ± 2.580.2 ± 3.468.7 ± 3.278.6 ± 2.778.7
  • GFlowPO 在所报道的表 Table 1 比较中的六个少样本文本分类数据集上实现了最高的平均准确率。
  • GFlowPO 在 SST-2、RTE 和 SNLI 上优于基线,在 QNLI 和 SNLI 上在任务/LM 对上具竞争力或最佳。
  • 在指令诱导与 BBII 任务中,GFlowPO 在平均准确率上持续优于基线,包括需要精确令牌匹配的文本生成任务。
  • 在问答任务(MMLU 和 OpenBookQA)中,GFlowPO 达到了最佳的 OpenBookQA 得分并且在 MMLU 结果上具竞争力。
  • 消融实验表明离策略训练与 DMU 的贡献是叠加的,DMU 与离策略学习结合时能带来显著提升。
  • 训练准确度曲线表明 GFlowPO 相较于 StablePrompt 更高效地发现高奖励提示。
Figure 2 : GFlowPO pipeline. The optimizer prompt-LM samples prompts conditioned on meta-prompt $M$ , the target LLM provides rewards, and off-policy GFlowNet training plus Dynamic Memory Update (DMU) iteratively improves exploration and prompt quality.
Figure 2 : GFlowPO pipeline. The optimizer prompt-LM samples prompts conditioned on meta-prompt $M$ , the target LLM provides rewards, and off-policy GFlowNet training plus Dynamic Memory Update (DMU) iteratively improves exploration and prompt quality.

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。