QUICK REVIEW

[论文解读] An Actor-Critic Contextual Bandit Algorithm for Personalized Mobile Health Interventions

Huitian Lei, Lu, Yangyi|arXiv (Cornell University)|Jun 28, 2017

Advanced Bandit Algorithms Research参考文献 20被引用 44

一句话总结

本文提出了一种用于学习个性化、实时移动健康干预（JITAIs）的在线演员-评论家上下文Bandit算法，通过将策略学习（演员）与奖励建模（评论家）解耦。在奖励线性假设下，该方法实现了稳定且渐近正态的估计，在数值实验中对模型误设表现出鲁棒性，从而推动了数据驱动、自适应的健康行为干预发展。

ABSTRACT

Increasing technological sophistication and widespread use of smartphones and wearable devices provide opportunities for innovative and highly personalized health interventions. A Just-In-Time Adaptive Intervention (JITAI) uses real-time data collection and communication capabilities of modern mobile devices to deliver interventions in real-time that are adapted to the in-the-moment needs of the user. The lack of methodological guidance in constructing data-based JITAIs remains a hurdle in advancing JITAI research despite the increasing popularity of JITAIs among clinical scientists. In this article, we make a first attempt to bridge this methodological gap by formulating the task of tailoring interventions in real-time as a contextual bandit problem. Interpretability requirements in the domain of mobile health lead us to formulate the problem differently from existing formulations intended for web applications such as ad or news article placement. Under the assumption of linear reward function, we choose the reward function (the "critic") parameterization separately from a lower dimensional parameterization of stochastic policies (the "actor"). We provide an online actor-critic algorithm that guides the construction and refinement of a JITAI. Asymptotic properties of the actor-critic algorithm are developed and backed up by numerical experiments. Additional numerical experiments are conducted to test the robustness of the algorithm when idealized assumptions used in the analysis of contextual bandit algorithm are breached.

研究动机与目标

为解决在移动健康领域构建数据驱动、即时自适应干预（JITAIs）的方法论空白。
将个性化JITAI设计形式化为一个注重可解释性的上下文Bandit问题，与网络应用中的表述方式不同。
开发一种在线演员-评论家算法，利用传感器和自我报告的序列数据学习用户特定策略。
在理想化假设下建立算法的渐近一致性和正态性。
评估当关键假设（如线性奖励、已知负担）被违反时的鲁棒性。

提出的方法

将JITAI学习问题形式化为一个具有上下文相关动作和奖励的上下文Bandit问题。
分别参数化评论家（奖励模型）和演员（随机策略），以实现可解释性并支持解耦学习。
使用在线更新在新数据到达时优化策略和奖励估计，支持实时适应。
采用双时间尺度随机逼近：评论家更新更快，以指导演员更新。
使用百分位数-t自助法构建策略参数的置信区间。
假设奖励函数为线性，使用最小二乘估计法估计评论家，使用策略梯度更新演员。

实验结果

研究问题

RQ1演员-评论家框架能否被适配到对可解释性和实时学习至关重要的移动健康场景中？
RQ2在标准假设下，所提出的在线算法是否能实现最优策略估计的渐近一致性和正态性？
RQ3当线性奖励假设或已知负担参数被违反时，该算法的鲁棒性如何？
RQ4在不同样本大小和负担效应下，该算法在估计策略参数方面的表现如何？
RQ5在有限样本下，该方法能否可靠地构建策略参数的置信区间？

主要发现

在理想化的i.i.d.和线性奖励假设下，该算法实现了策略参数估计的渐近一致性和正态性。
数值实验表明，当线性奖励假设被违反时，该算法仍表现出鲁棒性，尤其在存在非线性或未观测到的负担效应时。
在样本量为500时，策略参数估计的均方误差（MSE）显著下降，最有利条件下MSE值低于0.01。
在大多数情形下，百分位数-t自助法置信区间的覆盖概率接近名义水平（0.95），但在高负担效应下出现部分覆盖不足，表22中标记了星号。
策略参数估计的偏差随样本量增大而减小，例如在n=200且τ=0.8时偏差约为0.55，而在n=500时降至约0.38，表明随时间推移准确性提高。
即使真实负担参数λ固定为其最优值（oracle值），该算法仍能成功学习最优策略，如表16–23所示，大多数情况下偏差和MSE均极低。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。