QUICK REVIEW

[论文解读] Dynamic Assortment Optimization with Changing Contextual Information

Xi Chen, Yining Wang|arXiv (Cornell University)|Oct 31, 2018

Advanced Bandit Algorithms Research参考文献 22被引用 27

一句话总结

该论文提出了一种基于UCB的策略，用于在非平稳、上下文相关的MNL模型下进行动态组合优化，其中产品效用线性依赖于时变特征。该策略实现了$\widetilde{O}(d\sqrt{T})$的遗憾界，在组合大小$K$为常数时达到最优（仅对数因子而言），并提出了一种高效的近似算法，用于高维特征空间中的组合优化。

ABSTRACT

In this paper, we study the dynamic assortment optimization problem under a finite selling season of length $T$. At each time period, the seller offers an arriving customer an assortment of substitutable products under a cardinality constraint, and the customer makes the purchase among offered products according to a discrete choice model. Most existing work associates each product with a real-valued fixed mean utility and assumes a multinomial logit choice (MNL) model. In many practical applications, feature/contexutal information of products is readily available. In this paper, we incorporate the feature information by assuming a linear relationship between the mean utility and the feature. In addition, we allow the feature information of products to change over time so that the underlying choice model can also be non-stationary. To solve the dynamic assortment optimization under this changing contextual MNL model, we need to simultaneously learn the underlying unknown coefficient and makes the decision on the assortment. To this end, we develop an upper confidence bound (UCB) based policy and establish the regret bound on the order of $\widetilde O(d\sqrt{T})$, where $d$ is the dimension of the feature and $\widetilde O$ suppresses logarithmic dependence. We further established the lower bound $Ω(d\sqrt{T}/K)$ where $K$ is the cardinality constraint of an offered assortment, which is usually small. When $K$ is a constant, our policy is optimal up to logarithmic factors. In the exploitation phase of the UCB algorithm, we need to solve a combinatorial optimization for assortment optimization based on the learned information. We further develop an approximation algorithm and an efficient greedy heuristic. The effectiveness of the proposed policy is further demonstrated by our numerical studies.

研究动机与目标

解决由于上下文特征随时间变化导致产品效用动态演变的动态组合优化问题。
设计一种上下文Bandit学习策略，同时学习未知的效用系数并满足组合大小约束下选择最优组合。
通过将平均效用建模为时变产品特征的线性函数，处理非平稳的选择行为。
为UCB策略中计算困难的组合优化步骤设计高效的近似算法。
在合理假设下，建立紧致的遗憾界，其最优性仅对数因子而言。

提出的方法

使用线性上下文MNL模型建模问题，其中$u_{tj} = v_{tj}^T \theta_0$，$v_{tj}$为时变产品特征。
提出一种基于UCB的策略，通过维护对未知系数$\theta_0$的置信区间，实现探索与利用的平衡。
引入一种多变量近似算法（算法5），通过随机投影将高维组合优化问题转化为多个单变量问题。
使用从单位球面采样的随机向量$y^{(\ell)}$对特征向量进行投影，高效求解降维后的问题。
采用贪心启发式方法从多个投影中选择最优子集，以最大化期望收益和置信区间项。
利用集中不等式和谱分析建立理论保证，以界定近似误差和遗憾。

实验结果

研究问题

RQ1当产品效用依赖于时变上下文特征时，基于UCB的策略能否在动态组合优化中实现次线性遗憾？
RQ2所提策略的性能如何随特征维度$d$和时间范围$T$变化？
RQ3在此非平稳、上下文相关的MNL设置下，遗憾的根本极限是什么？策略能多接近这一极限？
RQ4能否为UCB框架中高维特征下NP难的组合优化步骤设计高效的近似算法？
RQ5随机投影维数$L$的选择如何影响遗憾与计算成本之间的权衡？

主要发现

所提UCB策略实现了$\widetilde{O}(d\sqrt{T})$的遗憾界，当组合大小$K$为常数时，该界在对数因子意义下最优。
建立了$\Omega(d\sqrt{T}/K)$的下界，表明当$K$较小时，该策略的遗憾界在对数因子意义下最优。
开发了一种近似算法，当$L \asymp \log(1/\delta)$时实现$\sqrt{d}$-近似，当$L \asymp e^{O(d)}\log(1/\delta)$时实现$2$-近似。
当近似误差$\varepsilon = T^{-1/2}$且失败概率$\delta = T^{-2}$时，每时间步的计算成本为$\widetilde{O}(K^9 N \nu^3 (1+\nu)^8 d^4 T^4)$。
在$\sqrt{d}$-近似下，累积遗憾上界为$O(\sqrt{d}) \cdot \mathrm{Regret}^*$；在$2$-近似下，上界为$O(1) \cdot \mathrm{Regret}^*$。
数值实验验证了所提策略在具有动态上下文信息的实际场景中的有效性。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。