QUICK REVIEW

[논문 리뷰] Combinatorial Multi-Armed Bandit and Its Extension to Probabilistically Triggered Arms

Wei Chen, Yajun Wang|arXiv (Cornell University)|2014. 07. 31.

Advanced Bandit Algorithms Research참고 문헌 37인용 수 123

한 줄 요약

이 논문은 비선형 보상 설정(예: 사회적 영향력 최적화 및 온라인 광고)에서 적용 가능한 확률적으로 유도되는 암호를 포함하는 일반화된 조합적 다수의 손잡이(Combinatorial Multi-Armed Bandit, CMAB) 프레임워크를 제안한다. CUCB 알고리즘을 제안하여 분포에 의존하는 오차를 O(log n)으로 보장하며, 이는 이전 연구 대비 더 날카운 오차 경계를 확보하였으며, 유계 스무쓰함과 (α,β)-근사 오рак루의 이론적 보장을 갖춘다.

ABSTRACT

We define a general framework for a large class of combinatorial multi-armed bandit (CMAB) problems, where subsets of base arms with unknown distributions form super arms. In each round, a super arm is played and the base arms contained in the super arm are played and their outcomes are observed. We further consider the extension in which more based arms could be probabilistically triggered based on the outcomes of already triggered arms. The reward of the super arm depends on the outcomes of all played arms, and it only needs to satisfy two mild assumptions, which allow a large class of nonlinear reward instances. We assume the availability of an offline (α,β)-approximation oracle that takes the means of the outcome distributions of arms and outputs a super arm that with probability β generates an α fraction of the optimal expected reward. The objective of an online learning algorithm for CMAB is to minimize (α,β)-approximation regret, which is the difference between the αβ fraction of the expected reward when always playing the optimal super arm, and the expected reward of playing super arms according to the algorithm. We provide CUCB algorithm that achieves O(log n) distribution-dependent regret, where n is the number of rounds played, and we further provide distribution-independent bounds for a large class of reward functions. Our regret analysis is tight in that it matches the bound of UCB1 algorithm (up to a constant factor) for the classical MAB problem, and it significantly improves the regret bound in a earlier paper on combinatorial bandits with linear rewards. We apply our CMAB framework to two new applications, probabilistic maximum coverage and social influence maximization, both having nonlinear reward structures. In particular, application to social influence maximization requires our extension on probabilistically triggered arms.

연구 동기 및 목표

비선형 보상 함수를 갖는 조합적 암호를 위한 일반 CMAB 프레임워크를 체계화하기.
한 암호를 선택할 경우 다른 암호들이 확률적으로 유도되는 경우를 다룰 수 있도록 CMAB를 확장하기.
제한된 피드백 하에서 (α,β)-근사 오차를 최소화하는 온라인 학습 알고리즘(CUCB)을 설계하기.
이 확장된 프레임워크에 대해 분포에 의존하는 및 분포에 무관한 오차 경계를 제공하기.
실세계 문제에 적용하기: 온라인 광고에서의 확률적 최대 커버리지 및 소셜 네트워크에서의 영향력 최적화 문제에 적용하기.

제안 방법

기본 암호의 부분집합인 슈퍼 암호를 갖는 CMAB 프레임워크를 제안하며, 보상은 비선형이고 유계 스무쓰 함수에 따라 모든 선택된 암호의 결과에 따라 달라진다.
일부 암호의 결과가 다른 암호에 의해 확률적으로 활성화되는 확률적 유도 암호 개념을 도입한다. 이는 바이러스 광고와 유사하다.
기대 보상이 주어졌을 때 최적 기대 보상의 최소 αβ를 확보하는 슈퍼 암호를 반환하는 (α,β)-근사 오라클을 활용한다.
암호 평균의 신뢰구간을 사용하여 탐색과 이용의 균형을 이루는 CUCB(조합적 상한 신뢰구간) 알고리즘을 설계한다.
신뢰구간 분석과 보상 함수의 스무쉬함을 고려하여 O(log n)의 분포에 의존하는 오차 경계를 유도한다.
보상 함수의 역함수 f(x)를 사용하여 분포에 무관한 오차 경계를 확립하며, |V|, |E|, 및 p_min에 명시적인 의존성을 갖는다.

실험 결과

연구 질문

RQ1일반 CMAB 프레임워크는 확률적으로 유도되는 암호를 다룰 수 있도록 확장될 수 있으며, 이 경우에도 날카운 오차 경계를 유지할 수 있는가?
RQ2비선형이고 유계 스무쉬 보상 함수가 존재하는 상황에서 CUCB 알고리즘이 어떻게 O(log n)의 분포에 의존하는 오차를 달성하는가?
RQ3계산의 난이도가 존재하는 조합적 밴딧 설정에서 (α,β)-근사 오라클은 오차 성능에 어떤 영향을 미치는가?
RQ4특히 영향력 최적화 문제에서, 확률적으로 유도되는 암호의 경우 1/p_i에 대한 오차 경계 의존성은 필수적인가?
RQ5f(x) = γx^ω 형태의 특정 보상 함수(ω < 1)에 대해 이론적 경계를 더 날카롭게 개선하거나 강화할 수 있는가?

주요 결과

CUCB 알고리즘은 O(log n)의 분포에 의존하는 오차를 달성하며, 고전적 MAB의 UCB1 알고리즘과 비슷한 渐近적 경계를 상수 배수 이내로 유지한다.
사회적 영향력 최적화의 경우, 각 암호에 대해 분포에 의존하는 오차 경계는 O(|V|²|E|² log n / Δ_min² p_i)이며, 추가로 O(|E|Δ_max) 항이 존재한다.
분포에 무관한 오차 경계는 O(|V|√(48|E|³n log n / p*)) + O(|E|Δ_max)로 표현되며, 문제 크기에 대해 다항식 의존성을 보인다.
보상 함수의 비선형성은 보상 함수의 유계 스무쉬 성질을 통해 처리 가능하며, 영향력 최적화의 경우 f(x) = |E||V|x로 표현된다.
이전 연구에서 영향력 최적화 문제에 대해 유계 스무쉬 성질에 대한 잘못된 주장이 있었으나, 수정된 분석을 통해 원래 함수 f(x) = |E||V|x가 유효하다는 것을 입증하였다.
오차 분석은 날카롭고, 선형 보상 함수를 갖는 이전의 조합적 밴딧 연구 대비 상당히 향상된 성능을 보였다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.