QUICK REVIEW

[論文レビュー] Choice-Model-Assisted Q-learning for Delayed-Feedback Revenue Management

Owen Shen, Patrick Jaillet|arXiv (Cornell University)|Feb 2, 2026

Supply Chain and Inventory Management被引用数 0

ひとこと要約

論文は遅延収益を補完する固定離散選択モデルを用いて遅延報酬を補完し、即時Q学習更新を可能にする Choice-Model-Assisted RL (CA-DQN) を提案する。収束境界を証明し、ホテル予約データを用いたシミュレーションで堅牢性と限界を検証する。

ABSTRACT

We study reinforcement learning for revenue management with delayed feedback, where a substantial fraction of value is determined by customer cancellations and modifications observed days after booking. We propose \emph{choice-model-assisted RL}: a calibrated discrete choice model is used as a fixed partial world model to impute the delayed component of the learning target at decision time. In the fixed-model deployment regime, we prove that tabular Q-learning with model-imputed targets converges to an $O(\varepsilon/(1-γ))$ neighborhood of the optimal Q-function, where $\varepsilon$ summarizes partial-model error, with an additional $O(t^{-1/2})$ sampling term. Experiments in a simulator calibrated from 61{,}619 hotel bookings (1{,}088 independent runs) show: (i) no statistically detectable difference from a maturity-buffer DQN baseline in stationary settings; (ii) positive effects under in-family parameter shifts, with significant gains in 5 of 10 shift scenarios after Holm--Bonferroni correction (up to 12.4\%); and (iii) consistent degradation under structural misspecification, where the choice model assumptions are violated (1.4--2.6\% lower revenue). These results characterize when partial behavioral models improve robustness under shift and when they introduce harmful bias.

研究の動機と目的

取消・変更により報酬が日後に現れる遅延フィードバックのある収益管理を扱う。
固定された離散選択モデルを部分的世界モデルとして埋め込み、意思決定時に遅延報酬を補完する。
モデル補完ターゲットを用いた表形式Q学習の理論的収束保証を確立する。
実ホテル予約データを用いたシミュレータで分布シフトと構造的推定誤差に対する堅牢性を経験的に評価する。
部分的な行動モデルが堅牢性を高める場合とバイアスを生む場合を特徴付ける。

提案手法

遅延フィードバックMDPをショックとともに定義し、即時報酬と遅延報酬を区別する。
事前学習済みの固定離散選択モデル（DCM）を部分的世界モデルとして埋め込み、意思決定時に遅延報酬を補完する。
DQ学習の更新のためにDCMから合成的な (r', s') サンプルを生成するモデル補完サンプリングを導入する。
有限時間収束境界を証明する： ||Q_t - Q*||_∞ = O(ε/(1-γ) + t^{-1/2}√log(...))、ε は DCM 誤差を反映。
DCMが学習を導く一方で解釈性と扱いやすさを維持する適応的な二時刻スケールの枠組みを示す。

Figure 3 : Learning curves in stationary settings. Both MB-DQN (orange) and Choice-Assisted DQN (blue) converge to similar performance levels across all training durations (n=20 seeds per method, shaded regions show 95% confidence intervals). No significant differences are detected ( $p>0.05$ at all

実験結果

リサーチクエスチョン

RQ1CA-DQN は DCM が静的設定で正しく指定された場合、MB-DQN と同等になり得るか。
RQ2家族内のシフト（需要/競合）に対する堅牢性を改善しつつ、性能を損なわないか。
RQ3DCM の構造的誤指定（IIA の逸脱、異質性、時系列動態）下で CA-DQN はどう機能するか。
RQ4固定されたモデル補完ターゲットを用いた場合のQ学習の理論的収束性はどうなるか。

主な発見

CA-DQN は DCM近似誤差による不可避的バイアスと、減衰するサンプリング項を伴い、ほぼ最適に近いQ関数へ収束する。
静的設定では、DCM が正しく指定されていればCA-DQN はMB-DQNと統計的に有意な差がなく、実務上等価と見なせる。
CA-DQN は家族内の複数のシフトに対して堅牢性を高め、一部のシナリオでは複数比較の補正後最大12.4%の利益増を示した。
構造的誤指定の下ではCA-DQN は一貫して劣化し（誤指定テストで収益が1.4%–2.6%低下）、バイアスと堅牢性のトレードオフが顕著。
61,619件のホテル予約データで校正したシミュレータを用いた実験は、堅牢性とバイアスの間のトレードオフを明示する。

Figure 4 : Robustness under parameter shifts across 10 scenarios. Choice-Assisted DQN (blue bars) shows mixed results compared to MB-DQN (orange bars): significant improvements in 4 scenarios (up to +12.4% under low demand), significant underperformance in 2 scenarios (up to -9.6% under high competi

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。