Skip to main content
QUICK REVIEW

[論文レビュー] FlexRec: Adapting LLM-based Recommenders for Flexible Needs via Reinforcement Learning

Yijun Pan, Weikang Qiu|arXiv (Cornell University)|Mar 12, 2026
Recommender Systems and Techniques被引用数 0
ひとこと要約

FlexRec introduces an uncertainty-aware, swap-based item-level reward for LLM-based recommender post-training, enabling dynamic, need-specific ranking with improved stability and performance.

ABSTRACT

Modern recommender systems must adapt to dynamic, need-specific objectives for diverse recommendation scenarios, yet most traditional recommenders are optimized for a single static target and struggle to reconfigure behavior on demand. Recent advances in reinforcement-learning-based post-training have unlocked strong instruction-following and reasoning capabilities in LLMs, suggesting a principled route for aligning them to complex recommendation goals. Motivated by this, we study closed-set autoregressive ranking, where an LLM generates a permutation over a fixed candidate set conditioned on user context and an explicit need instruction. However, applying RL to this setting faces two key obstacles: (i) sequence-level rewards yield coarse credit assignment that fails to provide fine-grained training signals, and (ii) interaction feedback is sparse and noisy, which together lead to inefficient and unstable updates. We propose FlexRec, a post-training RL framework that addresses both issues with (1) a causally grounded item-level reward based on counterfactual swaps within the remaining candidate pool, and (2) critic-guided, uncertainty-aware scaling that explicitly models reward uncertainty and down-weights low-confidence rewards to stabilize learning under sparse supervision. Across diverse recommendation scenarios and objectives, FlexRec achieves substantial gains: it improves NDCG@5 by up to extbf{59\%} and Recall@5 by up to extbf{109.4\%} in need-specific ranking, and further achieves up to extbf{24.1\%} Recall@5 improvement under generalization settings, outperforming strong traditional recommenders and LLM-based baselines.

研究の動機と目的

  • Motivate the need for dynamic, need-specific recommendations beyond static objectives.
  • Introduce a principled RL post-training framework for LLM-based recommenders.
  • Develop fine-grained, item-level credit assignment suitable for autoregressive ranking.
  • Incorporate uncertainty modeling to stabilize learning under sparse feedback.
  • Show that a universal, multi-need LLM ranker can generalize across needs.

提案手法

  • Define need-conditioned autoregressive ranking where the model outputs a permutation of a fixed candidate set conditioned on user context and a specified need.
  • Propose swap-based item-level rewards via counterfactual swaps within the remaining candidate pool to provide dense, position-aware supervision.
  • Train a critic to predict both reward values and their uncertainty, and integrate this uncertainty into GRPO updates to down-weight unreliable rewards.
  • Apply uncertainty-aware GRPO to stabilize learning when interactions are sparse and noisy.
  • Normalize and combine item-level advantages with sequence-level advantages to guide learning for both item tokens and non-item tokens.
Figure 1 : Overall framework of FlexRec. Given a candidate set and an explicit user need, an LLM recommender generates ranked rollouts. An item-level reward is computed by evaluating the marginal contribution of individual item placements via counterfactual swaps (top right). A critic predicts both
Figure 1 : Overall framework of FlexRec. Given a candidate set and an explicit user need, an LLM recommender generates ranked rollouts. An item-level reward is computed by evaluating the marginal contribution of individual item placements via counterfactual swaps (top right). A critic predicts both

実験結果

リサーチクエスチョン

  • RQ1How can we design fine-grained, item-level credit signals for autoregressive ranking in LLM-based recommenders?
  • RQ2Can uncertainty-aware updates stabilize RL from verifiable rewards under sparse feedback?
  • RQ3Do need-specific, post-trained LLM rankers generalize across different recommendation needs?
  • RQ4Is a single jointly trained model capable of serving multiple needs with inference-time need conditioning?
  • RQ5What performance gains do swap-based item-level rewards and uncertainty modeling provide over sequence-level rewards?

主な発見

  • FlexRec achieves substantial improvements in need-specific ranking, with NDCG@5 gains up to 59% and Recall@5 gains up to 109.4% in Maximizing Interest scenarios.
  • Uncertainty-aware updates stabilize training and improve performance under sparse feedback, outperforming CF-based imputation and plain RLVR.
  • Swap-based item-level rewards provide dense, causal credit signals and outperform non-causal or purely sequence-level rewards.
  • FlexRec generalizes across needs, with zero-shot transfer showing improved performance on Explore New Topics and Trend Promotion when trained on Max-Interest.
  • A single model trained jointly on all needs can act as a universal recommender, adapting behavior via need instructions at inference.
  • Across datasets (KuaiRec, MovieLens-1M, ESCI), FlexRec consistently outperforms traditional rerankers and other post-trained LLM baselines.
Figure 2 : Performance across all three needs on KuaiRec. FlexRec is trained jointly on all needs. Joint training yields consistently stronger performance across needs, supporting FlexRec as an all-purpose recommender conditioned by need instructions.
Figure 2 : Performance across all three needs on KuaiRec. FlexRec is trained jointly on all needs. Joint training yields consistently stronger performance across needs, supporting FlexRec as an all-purpose recommender conditioned by need instructions.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。