QUICK REVIEW

[논문 리뷰] Efficient Algorithms for Adversarial Contextual Learning

Vasilis Syrgkanis, Akshay Krishnamurthy|arXiv (Cornell University)|2016. 02. 08.

Advanced Bandit Algorithms Research참고 문헌 30인용 수 45

한 줄 요약

이 논문은 최적화 오라클을 활용한 Follow-the-Perturbed-Leader 프레임워크를 통해, 적대적 맥락적 밴드잇 및 온라인 조합 최적화 문제에 대해 처음으로 오라클 효율적이고 하위선형의 손실을 갖는 알고리즘을 제안한다. 이는 전도적 설정에서 $O(T^{3/4}\sqrt{K\log N})$의 손실을, 소작분리자 설정에서 $O(T^{2/3}d^{3/4}K\sqrt{\log N})$의 손실을 달성한다. 여기서 $T$는 시간, $K$는 행동 수, $N$은 정책 수, $d$는 분리자 크기이다.

ABSTRACT

We provide the first oracle efficient sublinear regret algorithms for adversarial versions of the contextual bandit problem. In this problem, the learner repeatedly makes an action on the basis of a context and receives reward for the chosen action, with the goal of achieving reward competitive with a large class of policies. We analyze two settings: i) in the transductive setting the learner knows the set of contexts a priori, ii) in the small separator setting, there exists a small set of contexts such that any two policies behave differently in one of the contexts in the set. Our algorithms fall into the follow the perturbed leader family \cite{Kalai2005} and achieve regret $O(T^{3/4}\sqrt{K\log(N)})$ in the transductive setting and $O(T^{2/3} d^{3/4} K\sqrt{\log(N)})$ in the separator setting, where $K$ is the number of actions, $N$ is the number of baseline policies, and $d$ is the size of the separator. We actually solve the more general adversarial contextual semi-bandit linear optimization problem, whilst in the full information setting we address the even more general contextual combinatorial optimization. We provide several extensions and implications of our algorithms, such as switching regret and efficient learning with predictable sequences.

연구 동기 및 목표

적대적 맥락 학습에서 통계적 성능와 계산 효율성 사이의 격차를 해소하기 위해.
정책 공간이 지수적으로 클 경우에도 계산적으로 효율적인 알고리즘을 개발하기 위해.
배치 최적화 문제에 대한 오라클 액세스만을 사용하여 적대적 설정에서 하위선형 손실을 달성하기 위해.
Follow-the-Perturbed-Leader 프레임워크를 적대적 맥락 및 반-밴드잇 설정으로 확장하기 위해.

제안 방법

정책 선택에 오직 최적화 오라클에 의존하는 새로운 Follow-the-Perturbed-Leader(FTPL) 알고리즘을 제안한다.
전도적 설정에 FTPL 프레임워크를 적용한다. 이는 모든 맥락이 사전에 알려져 있음을 의미한다.
소작분리자 설정을 도입한다. 이는 어떤 두 정책도 작은 맥락 집합으로 구분될 수 있음을 의미한다.
정책 클래스의 복잡도 측정으로 Natarajan 차원을 사용하며, VC 차원을 일반화한다.
Neu & Bartók(2013)의 기법을 활용해 알고리즘을 반-밴드잇 및 밴드잇 설정으로 확장한다.
랜덤화된 편향과 오라클 기반 정책 선택을 통해 계산 효율성을 유지한다.

실험 결과

연구 질문

RQ1정책 공간이 클 경우, 계산 효율성을 확보하면서도 적대적 맥락적 밴드잇 문제에서 하위선형 손실을 달성할 수 있는가?
RQ2오라클 액세스만을 사용하여, Follow-the-Perturbed-Leader 프레임워크를 적대적 맥락 및 반-밴드잇 설정으로 확장할 수 있는가?
RQ3정책 클래스의 어떤 구조적 성질이 적대적 맥락 설정에서의 효율적 학습을 가능하게 하는가?
RQ4최소 분리자 크기가 온라인 학습의 손실 한계에 어떤 영향을 미치는가?
RQ5비전도적 설정에서 적대적 맥락과 손실 시퀀스가 존재할 경우, 하위선형 손실을 달성할 수 있는가?

주요 결과

알고리즘은 전도적 설정에서 $O(T^{3/4}\sqrt{K\log N})$의 손실을 달성한다. 여기서 $T$는 시간, $K$는 행동 수, $N$은 정책 수이다.
소작분리자 설정에서는 손실 한계가 $O(T^{2/3}d^{3/4}K\sqrt{\log N})$이며, $d$는 최소 분리자의 크기이다.
정책 클래스의 Natarajan 차원이 유한할 경우, 알고리즘은 적대적 및 적응형 맥락 및 손실 시퀀스에 대해서도 하위선형 손실을 유지한다.
정책 클래스의 Natarajan 차원이 $\nu$일 경우, $\epsilon$을 최적화할 때 손실은 $O((d\nu\log K\log(dK/\nu))^{1/4}\sqrt{T})$가 된다.
VC-차원이 1인 정책 클래스에 대해서는, 비전도적 설정에서 적응형 적대자에 대해 하위선형 손실을 달성할 수 있는 알고리즘이 존재하지 않는다.
결과적으로, 일반 정책 클래스를 사용하는 적대적 맥락 학습에서 하위선형 손실을 달성하기 위해 전도적 지식이 필수적임을 보여준다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.