[論文レビュー] FlexRec: Adapting LLM-based Recommenders for Flexible Needs via Reinforcement Learning
FlexRec introduces an uncertainty-aware, swap-based item-level reward for LLM-based recommender post-training, enabling dynamic, need-specific ranking with improved stability and performance.
Modern recommender systems must adapt to dynamic, need-specific objectives for diverse recommendation scenarios, yet most traditional recommenders are optimized for a single static target and struggle to reconfigure behavior on demand. Recent advances in reinforcement-learning-based post-training have unlocked strong instruction-following and reasoning capabilities in LLMs, suggesting a principled route for aligning them to complex recommendation goals. Motivated by this, we study closed-set autoregressive ranking, where an LLM generates a permutation over a fixed candidate set conditioned on user context and an explicit need instruction. However, applying RL to this setting faces two key obstacles: (i) sequence-level rewards yield coarse credit assignment that fails to provide fine-grained training signals, and (ii) interaction feedback is sparse and noisy, which together lead to inefficient and unstable updates. We propose FlexRec, a post-training RL framework that addresses both issues with (1) a causally grounded item-level reward based on counterfactual swaps within the remaining candidate pool, and (2) critic-guided, uncertainty-aware scaling that explicitly models reward uncertainty and down-weights low-confidence rewards to stabilize learning under sparse supervision. Across diverse recommendation scenarios and objectives, FlexRec achieves substantial gains: it improves NDCG@5 by up to extbf{59\%} and Recall@5 by up to extbf{109.4\%} in need-specific ranking, and further achieves up to extbf{24.1\%} Recall@5 improvement under generalization settings, outperforming strong traditional recommenders and LLM-based baselines.
研究の動機と目的
- Motivate the need for dynamic, need-specific recommendations beyond static objectives.
- Introduce a principled RL post-training framework for LLM-based recommenders.
- Develop fine-grained, item-level credit assignment suitable for autoregressive ranking.
- Incorporate uncertainty modeling to stabilize learning under sparse feedback.
- Show that a universal, multi-need LLM ranker can generalize across needs.
提案手法
- Define need-conditioned autoregressive ranking where the model outputs a permutation of a fixed candidate set conditioned on user context and a specified need.
- Propose swap-based item-level rewards via counterfactual swaps within the remaining candidate pool to provide dense, position-aware supervision.
- Train a critic to predict both reward values and their uncertainty, and integrate this uncertainty into GRPO updates to down-weight unreliable rewards.
- Apply uncertainty-aware GRPO to stabilize learning when interactions are sparse and noisy.
- Normalize and combine item-level advantages with sequence-level advantages to guide learning for both item tokens and non-item tokens.

実験結果
リサーチクエスチョン
- RQ1How can we design fine-grained, item-level credit signals for autoregressive ranking in LLM-based recommenders?
- RQ2Can uncertainty-aware updates stabilize RL from verifiable rewards under sparse feedback?
- RQ3Do need-specific, post-trained LLM rankers generalize across different recommendation needs?
- RQ4Is a single jointly trained model capable of serving multiple needs with inference-time need conditioning?
- RQ5What performance gains do swap-based item-level rewards and uncertainty modeling provide over sequence-level rewards?
主な発見
- FlexRec achieves substantial improvements in need-specific ranking, with NDCG@5 gains up to 59% and Recall@5 gains up to 109.4% in Maximizing Interest scenarios.
- Uncertainty-aware updates stabilize training and improve performance under sparse feedback, outperforming CF-based imputation and plain RLVR.
- Swap-based item-level rewards provide dense, causal credit signals and outperform non-causal or purely sequence-level rewards.
- FlexRec generalizes across needs, with zero-shot transfer showing improved performance on Explore New Topics and Trend Promotion when trained on Max-Interest.
- A single model trained jointly on all needs can act as a universal recommender, adapting behavior via need instructions at inference.
- Across datasets (KuaiRec, MovieLens-1M, ESCI), FlexRec consistently outperforms traditional rerankers and other post-trained LLM baselines.

より良い研究を、今すぐ始めましょう
論文設計から論文執筆まで、研究時間を劇的に削減しましょう。
クレジットカード登録不要
このレビューはAIが作成し、人間の編集者が確認しました。