QUICK REVIEW

[논문 리뷰] Thompson Sampling for Combinatorial Semi-Bandits

Siwei Wang, Wei Chen|arXiv (Cornell University)|2018. 03. 13.

Advanced Bandit Algorithms Research참고 문헌 27인용 수 28

한 줄 요약

이 논문은 독립적인 암부 분포를 가진 일반적인 조합적 다중 암부 밴드잇(Combinatorial Multi-Armed Bandits, CMAB) 프레임워크에 대해 조합적 톰슨 샘플링(Combinatorial Thompson Sampling, CTS)을 제안하며, 베이지안 샘플링과 새로운 분석 기법을 활용하여 개선된 리그레트 한계를 달성한다. 분포에 의존하는 리그레트 한계 $O(m\log K_{\max}\log T/\Delta_{\min})$를 확립하여 기존의 UCB 기반 방법들을 능가하고, 매트로이드 설정에서는 이론적 하한과 일치한다.

ABSTRACT

In this paper, we study the application of the Thompson sampling (TS) methodology to the stochastic combinatorial multi-armed bandit (CMAB) framework. We first analyze the standard TS algorithm for the general CMAB model when the outcome distributions of all the base arms are independent, and obtain a distribution-dependent regret bound of $O(m\log K_{\max}\log T / Δ_{\min})$, where $m$ is the number of base arms, $K_{\max}$ is the size of the largest super arm, $T$ is the time horizon, and $Δ_{\min}$ is the minimum gap between the expected reward of the optimal solution and any non-optimal solution. This regret upper bound is better than the $O(m(\log K_{\max})^2\log T / Δ_{\min})$ bound in prior works. Moreover, our novel analysis techniques can help to tighten the regret bounds of other existing UCB-based policies (e.g., ESCB), as we improve the method of counting the cumulative regret. Then we consider the matroid bandit setting (a special class of CMAB model), where we could remove the independence assumption across arms and achieve a regret upper bound that matches the lower bound. Except for the regret upper bounds, we also point out that one cannot directly replace the exact offline oracle (which takes the parameters of an offline problem instance as input and outputs the exact best action under this instance) with an approximation oracle in TS algorithm for even the classical MAB problem. Finally, we use some experiments to show the comparison between regrets of TS and other existing algorithms, the experimental results show that TS outperforms existing baselines.

연구 동기 및 목표

독립적인 암부 분포를 가진 일반적인 조합적 다중 암부 밴드잇(CMAB) 프레임워크에 대해 톰슨 샘플링을 개발하고 분석하는 것.
기존의 UCB 기반 정책인 ESCB와 CUCB와 비교해 CTS에 대해 더 날카운 리그레트 한계를 확립하는 것.
매트로이드 밴드잇 설정으로 분석을 확장하여, CTS가 정보 이론적 하한과 일치하는 리그레트 한계를 달성하는 것.
톰슨 샘플링에서 정확한 오프라인 오라클을 근사 오라클로 대체하는 데서 발생하는 한계를 조사하는 것.
CMAB 및 매트로이드 밴드잇 문제에서 최신 기술과의 실험적 검증을 통해 CTS를 검증하는 것.

제안 방법

후행 분포에서 매개변수를 샘플링하고, 이러한 샘플에 기반해 슈퍼 암을 선택함으로써 CMAB에 톰슨 샘플링을 적용하는 것.
각 관측 이후 베이즈의 정리에 따라 후행 분포를 개선하는 베이지안 업데이트를 사용하는 것.
누적 리그레트 수를 향상시켜 더 날카운 한계를 도출하는 새로운 리그레트 분석 기법을 도입하는 것.
독립적인 암부 분포에 대해 $O(m\\log K_{\\max}\\log T/\\Delta_{\\min})$의 리그레트 상한을 확립하는 것.
독립성 가정을 제거함으로써 매트로이드 밴드잇으로 분석을 확장하여, 하한과 일치하는 리그레트 한계를 달성하는 것.
정확한 오프라인 오라클을 근사 오라클로 직접 대체할 수 없음을 입증함. 이는 고전적 MAB에서도 마찬가지이다.

실험 결과

연구 질문

RQ1일반적인 CMAB 모델에서 기존의 UCB 기반 정책보다 톰슨 샘플링이 더 날카운 리그레트 한계를 달성할 수 있는가?
RQ2일반적인 CMAB 및 매트로이드 밴드잇 설정에서 CTS의 리그레트 성능은 CUCB, C-KL-UCB, ESCB와 비교해 어떻게 되는가?
RQ3매트로이드 밴드잇 설정에서 CTS의 이론적 리그레트 한계는 무엇이며, 정보 이론적 하한과 일치하는가?
RQ4왜 고전적 MAB에서도 근사 오라클의 사용이 톰슨 샘플링에서 실패하는가?
RQ5제안된 분석 기법은 다른 UCB 기반 정책의 리그레트 한계 향상에 일반화될 수 있는가?

주요 결과

제안된 CTS 알고리즘은 $O(m\log K_{\max}\log T/\Delta_{\min})$의 분포에 의존하는 리그레트 한계를 달성하며, 이는 이전의 $O(m(\log K_{\max})^2\log T/\Delta_{\min})$ 한계보다 더 날카롭다.
새로운 리그레트 분석 기법은 누적 리그레트 수를 향상시켜 더 날카운 한계를 가능하게 하며, ESCB와 같은 다른 UCB 기반 정책에도 적용 가능하다.
매트로이드 밴드잇 설정에서 CTS는 암부 간 독립성 가정 없이도 정보 이론적 하한과 일치하는 리그레트 한계를 달성한다.
최대 스패닝 트리 및 최단 경로 문제에 대한 실험 결과, CTS는 누적 리그레트 측면에서 CUCB, C-KL-UCB, ESCB를 일관되게 능가한다.
이론적 보장이 없는 파rameter(예: C-KL-UCB-m)를 사용할 때조차도 CTS는 이러한 기준보다 뛰어나며, $T$가 증가할수록 더욱 두드러진다.
연구는 근사 오라클이 톰슨 샘플링에서 정확한 오프라인 오라클로 대체될 수 없음을 확인한다. 이는 고전적 MAB에서도 기본적인 베이지안 추론 제약 때문에 발생한다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.