




版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
1、Regret Bounds for Thompson Sampling in Episodic Restless Bandit Problems Young Hun Jung Department of Statistics University of Michigan Ambuj Tewari Department of Statistics University of Michigan Abstract Restless bandit problems are instances of non-stationary mult
2、i-armed bandits. These problems have been studied well from the optimization perspective, where the goal is to effi ciently fi nd a near-optimal policy when system parameters are known. However, very few papers adopt a learning perspective, where the parame- ters are unknown. In this paper, we analy
3、ze the performance of Thompson sampling in episodic restless bandits with unknown parameters. We consider a general policy map to defi ne our competitor and prove an O(T)Bayesian regret bound. Our competitor is fl exible enough to represent various benchmarks including the best fi xed action policy,
4、 the optimal policy, the Whittle index policy, or the myopic policy. We also present empirical results that support our theoretical fi ndings. 1Introduction Restless bandits Whittle, 1988 are variants of multi-armed bandit (MAB) problems Robbins, 1952. Unlike the classical MABs, the arms have non-st
5、ationary reward distributions. Specifi cally, we will focus on the class of restless bandits whose arms change their states based on Markov chains. Restless bandits are also distinguished from rested bandits where only the active arms evolve and the passive arms remain frozen. We will assume that ea
6、ch arm changes according to two different Markov chains depending on whether it is played or not. Because of their extra fl exibility in modeling non-stationarity, restless bandits have been applied to practical problems such as dynamic channel access problems Liu et al., 2011, 2013 and online recom
7、mendation systems Meshram et al., 2017. Due to the arms non-stationary nature, playing the same set of arms for every round usually does not produce the optimal performance. This makes the optimal policy highly non-trivial, and Papadimitriou and Tsitsiklis 1999 show that it is generally PSPACE hard
8、to identify the optimal policy for restless bandits. As a consequence, many researchers have been devoted to fi nd an effi cient way to approximate the optimal policy Liu and Zhao, 2010, Meshram et al., 2018. This line of work primarily focuses on the optimization perspective in that the system para
9、meters are already known. Since the true system parameters are unavailable in many cases, it becomes important to examine restless bandits from a learning perspective. Due to the learners additional uncertainty, however, analyzing a learning algorithm in restless bandits is signifi cantly challengin
10、g. Liu et al. 2011, 2013 and Tekin and Liu 2012 proveO(logT) bounds for confi dence bound based algorithms, but their competitor always selects a fi xed set of actions, which is known to be weak (see Section 5 for an empirical example of the weakness of the best fi xed action competitor). Dai et al.
11、 2011, 2014 show O(logT)bounds against the optimal policy, but their assumptions on the underlying model are very limited. Ortner et al. 2012 prove an O(T)bound in general restless bandits, but their algorithm is intractable in general. 33rd Conference on Neural Information Processing Systems (NeurI
12、PS 2019), Vancouver, Canada. In a different line of work, Osband et al. 2013 study Thompson sampling in the setting of a fully observable Markov decision process (MDP) and show the Bayesian regret bound of O(T)(hiding dependence on system parameters like state and action space size). Unfortunately,
13、this result is not applicable in our setting as ours is partially observable due to bandit feedback. Following Ortner et al. 2012, it is possible to transform our setting to the fully observable case, but then we end up having exponentially many states, which restricts the practical utility of exist
14、ing results. In this work, we analyze Thompson sampling in restless bandits where the system resets at the end of every fi xed-length episode and the rewards are binary. We emphasize that this episodic assumption simplifi es our analysis as the problem boils down to a fi nite time horizon problem. T
15、his assumption can be arguably limited, but there are applications such as dynamic channel access problems where the channel provider might reset their system every night for a maintenance-related reason and the episodic assumption becomes natural. We directly tackle the partial observability and ac
16、hieve a meaningful regret bound, which when restricted to the classical MABs matches the Thompson sampling result in that setting. We are not the fi rst to analyze Thompson sampling in restless bandits, and Meshram et al. 2016 study this type of algorithm as well, but their regret analysis remains i
17、n the one-armed-case with a fi xed reward of not pulling the arm. They explicitly mention that a regret analysis of Thompson sampling in the multi-armed case is an interesting open question. 2Problem setting We begin by introducing our setting. There areKarms indexed byk = 1, ,K, and the algorithm s
18、electsNarms every round. We denote the learners action at timetby a binary vectorAt 0,1K where|At|1= N. We call the selected arms as active and the rest as passive. We assume each arm khas binary states,0,1, which evolve as a Markov chain with transition matrix eitherPactive k or P passive k , depen
19、ding on whether the learner pulled the arm or not. At roundt, pulling an armkincurs a binary rewardXt,k, which is the arms current state. As we are in the bandit setting, the learner only observes the rewards of active arms, which we denote byXt,At, and does not observe the passive arms rewards nor
20、their states. This feature makes our setting to be a partially observable Markov decision process, or POMDP. We denote the history of the learners actions and rewards up to time t by Ht= (A1,X1,A1, ,At,Xt,At). We assume the system resets every episode of lengthL, which is also known to the learner.
21、This means that at the beginning of each episode, the states of the arms are drawn from an initial distribution. The entire time horizon is denoted byT, and for simplicity, we assume it is a multiple ofL, orT = mL. 2.1Bayesian regret and competitor policy Let denote the entire parameters of the syst
22、em. It includes transition matricesPactiveand Ppassive, and an initial distribution of each arms state. The learner only knows the prior distribution of this parameter at the beginning and does not have access to the exact value. In order to defi ne a regret, we need a competitor policy, or a benchm
23、ark. We fi rst defi ne a class of deterministic policies and policy mappings. Defi nition 1.A deterministic policytakes time index and history(t,Ht1)as an input and outputs a fi xed actionAt= (t,Ht1). A deterministic policy mappingtakes a system parameteras an input and outputs a deterministic polic
24、y = (). We fi x a deterministic policy mappingand let our algorithm compete against a deterministic policy ?= (?), where ?represents the true system parameter, which is unknown to the learner. We keep our competitor policy abstract mainly because we are in the non-stationary setting. Unlike the clas
25、sical (stationary) MABs, pulling the same set of arms with the largest expected rewards is not necessarily optimal. Moreover, it is in general PSPACE hard to compute the optimal policy when? is given. Regarding these statements, we refer the readers to the book by Gittins et al. 1989. As a consequen
26、ce, researchers have identifi ed conditions that the (effi cient) myopic policy is optimal Ahmad et al., 2009 or proven that a tractable index-based policy has a reasonable performance against the optimal policy Liu and Zhao, 2010. 2 Algorithm 1 Thompson sampling in restless bandits 1:Input prior Q,
27、 episode length L, policy mapping 2:Initialize posterior Q1= Q, history H = 3:for episodes l = 1, ,m do 4:Draw a parameter l Qland compute the policy l= (l) 5:Set H0= 6:for t = 1, ,L do 7:Select N active arms At= l(t,Ht1) 8:Observe rewards Xt,Atand update Ht 9:end for 10:Append HLto H and update pos
28、terior distribution Ql+1using H 11:end for We observe that most of proposed policies including the optimal policy, the myopic policy, or the index-based policy are deterministic. Therefore, researchers can plug in whatever competitor policy of their choice, and our regret bound will apply as long as
29、 the chosen policy mapping is deterministic. Before defi ning the regret, we introduce a value function V ,i(H) = E, L X j=i Aj Xj|H.(1) This is the expected reward of running a policyfrom rounditoLwhere the system parameter isand the starting history isH. Note that the benchmark policy?obtainsV ? ?
30、,1()rewards per episode in expectation. Thus, we can defi ne the regret as R(T;?) = mV ? ?,1() E? T X t=1 At Xt.(2) If an algorithm chooses to fi x a policylfor the entire episodel, which is the case of our algorithm, then the regret can be written as R(T;?) = mV ? ?,1() E? m X l=1 V ? l,1() = E? m
31、X l=1 V ? ?,1() V ? l,1(). We particularly focus on the case where ?is a random and bound the following Bayesian regret, BR(T) = E?QR(T;?), whereQis a prior distribution over the set of system parameters. We assume that the prior is known to the learner. We caution our readers that there is at least
32、 one other regret defi nition in the literature, which is called either frequentist regret or worst-case regret. For this type of regret, one views? as a fi xed unknown object and directly boundsR(T;?). Even though our primary interest is to bound the Bayesian regret, we can establish a connection t
33、o the frequentist regret in the special case where the priorQ has a fi nite support and the benchmark is the optimal policy (see Corollary 6). 3Algorithm Our algorithm is an instance of Thompson sampling or posterior sampling, fi rst proposed by Thomp- son 1933. At the beginning of episodel, the alg
34、orithm draws a system parameterlfrom the posterior and playsl= (l)throughout the episode. Once an episode is over, it updates the posterior based on additional observations. Algorithm 1 describes the steps. We want to point out that the historyH fulfi lls two different purposes. One is to update the
35、 posterior Ql, and the other is as an input to a policyl. For the latter, however, we do not need the entire history as the arms reset every episode. That is why we setH0= (step 5) and feedHt1tol(step 7). Furthermore, as we assume that the arms evolve based on Markov chains, the historyHt1can be sum
36、marized as (r1,n1, ,rK,nK),(3) 3 which means that an armkis playednkrounds ago andrkis the observed reward in that round. If an armkis never played in the episode, thennkbecomest, andrkbecomes the expected reward from the initial distribution based onl . As we assume the episode length is fi xed to
37、beL, there are Lpossible values fornk. Due to the binary reward assumption,rkcan take three values including the case where the armkis never played. From these, we can infer that there are(3L)Kpossible tuples of(r1,n1, ,rK,nK). By considering these tuples as states and following the reasoning of Ort
38、ner et al. 2012, one can view our POMDP as a fully observable MDP. Then one can use the existing algorithms for fully observable MDPs (e.g., Osband et al. 2013), but the regret bounds easily become vacuous since the number of states depends exponentially on the number of armsK. Additionally, as we a
39、ssumed a policy mapping, one might argue to use existing expert learning or classical MAB algorithms considering potential policies as experts or arms. This is possible, but the number of potential policies corresponds to the size of , which can be very large or even infi nite. For this reason, exis
40、ting algorithms are not effi cient and/or their regret bounds become too loose. Due to its generality, it is hard to analyze the time and space complexity of Algorithm 1. Two major steps are computing the policy (step 4) and updating posterior (step 10). Computing the policy depends on our choice of
41、 competitor mapping. If the competitor policy has better performance but is harder to compute, then our regret bound gets more meaningful as the benchmark is stronger, but the running time gets longer. Regarding the posterior update, the computational burden depends on the choice of the priorQand it
42、s support. If there is a closed-form update, then the step is computationally cheap, but otherwise the burden increases with respect to the size of the support. 4Regret bound In this section, we bound the Bayesian regret of Algorithm 1 by O(T). A key idea in our analysis of Thompson sampling is that
43、 the distributions of?andlare identical given the history up to the end of episodel 1(e.g., see Lattimore and Szepesvri, Chp. 36). To state it more formally, let(H)be the-algebra generated by the historyH. Then we call a random variableXis(H)-measurable, or simplyH-measurable, if its value is determ
44、inistically known given the information(H). Similarly, we call a random functionfisH-measurable if its mapping is deterministically known given(H). We record as a lemma an observation made by Russo and Van Roy 2014. Lemma 2. (Expectation identity)Suppose?andlhave the same distribution givenH. For an
45、y H-measurable function f, we have Ef(?)|H = Ef(l)|H. Recall that we assume the competitor mappingis deterministic. Furthermore, the value function V ,i()in(1)is deterministic givenand. This impliesEV ? ?,i()|H = EV l l,i()|H,where His the history up to the end of episodel 1. This observation leads
46、to the following regret decomposition. Lemma 3. (Regret decomposition) The Bayesian regret of Algorithm 1 can be decomposed as BR(T) = E?Q m X l=1 ElQlV ? ?,1() V ? l,1() = E?Q m X l=1 ElQlV l l,1() V ? l,1(). Proof. The fi rst equality is a simple rewriting of(2) because Algorithm 1 fi xes a policy
47、lfor the entire episode l. Then we apply Lemma 2 along with the tower rule to get E?Q m X l=1 ElQlV ? ?,1() = E?Q m X l=1 ElQlV l l,1(). Note that we can computeV l l,1()as we knowlandl. We can also infer the value ofV ? l,1() from the algorithms observations. The main point of Lemma 3 is to rewrite
48、 the Bayesian regret using terms that are relatively easy to analyze. Next, we defi ne the Bellman operator T V (Ht1) = E,At Xt+ V (Ht)|Ht1. It is not hard to check that V ,i= T V ,i+1. The next lemma further decomposes the regret. 4 Lemma 4. (Per-episode regret decomposition) Fix ?and l, and let H0
49、= . Then we have V l l,1(H0) V ? l,1(H0) = E?,l L X t=1 (T l l T ? l )V l l,t+1(Ht1). Proof. Using the relation V ,i= T V ,i+1, we may write V l l,1(H0) V ? l,1(H0) = (T l lV l l,2 T ? l V ? l,2)(H0) = (T l l T ? l )V l l,2(H0) + T ? l (V l l,2 V ? l,2)(H0). The second term can be written asE?,l(V l
50、 l,2V ? l,2)(H1)|H0, and we can repeat thisLtimes to obtain the desired equation. Now we are ready to prove our main theorem. A complete proof can be found in Appendix A. Theorem 5. (Bayesian regret bound of the Thompson sampling)The Bayesian regret of Algo- rithm 1 satisfi es the following bound BR
51、(T) = O( p KL3N3T logT) = O( p mKL4N3log(mL). Remark.If the system is the classical stationary MAB, then it corresponds to the caseL = 1,N = 1, and our result reproduces the result ofO(KT logT)Lattimore and Szepesvri, Chp. 36. This suggests our bound is optimal inKandTup to a logarithmic factor. Fur
52、ther, whenN K 2 , we can think of the problem as choosing the passive arms, and the smaller bound withNreplaced by K Nwould apply. WhenL = 1, the problem becomes combinatorial bandits of choosingN active arms out ofK. Cesa-Bianchi and Lugosi 2012 propose an algorithm with a regret bound O(KNT logK)w
53、ith an assumption that the loss is always bounded by1. Since our reward can be as big asN, our bound has the same dependence onNwith theirs, suggesting tight dependence of our bound on N. Proof Sketch. We fi x an episodeland analyze the regret in this episode. Lettl= (l 1)Lso that the episode starts
54、 at timetl+ 1 . Defi neNl(k,r,n) = Ptl t=11At,k = 1,rk= r,nk= n, which counts the number of rounds where the armkwas chosen by the learner with historyrk= rand nk= n(see(3) for defi nition). Note thatk K,r 0,1,(k), and n L, where(k)is the initial success rate of the arm k. This implies there are 3KL
55、 tuples of (k,r,n). Let(k,r,n)denote the conditional probability ofXk= 1given a history(r,n)and a system parameter. Also let (k,r,n)denote the empirical mean of this quantity (usingNl(k,r,n)past observations and set the estimate to 0 if Nl (k,r,n) = 0). Then defi ne l= ( | (k,r,n), |( )(k,r,n)| s 2l
56、og(1/) 1 Nl(k,r,n) ) . Since (k,r,n)isHtl-measurable, so is the setl. Using the Hoeffding inequality, one can show P(?/ l) = P(l/ l) 3KL. In other words, we can claim that with high probability, |l(k,r,n) ?(k,r,n)| is small for all (k,r,n). We now turn our attention to the following Bellman operator
57、 T lV l l,t(Ht1) = E,lAtl+t Xtl+t+ V l l,t(Ht)|Ht1. Sincelis a deterministic policy,Atl+tis also deterministic givenHt1andl. Let(k1,.,kN) be the active arms at time tl+ t and write (ki,rki,nki) = ,i. Then we can rewrite T lV l l,t(Ht1) = N X i=1 ,i+ X x0,1N P xV l l,t(Ht1 (Atl+t,x), where P x = QN i
58、=1 xi ,i(1 ,i) 1xi. Under the event that ?,l l, we have |l,i ?,i| 1 s 8log(1/) 1 Nl(ki,rki,nki) =: i(tl+ t), 5 where the dependence ontl+tcomes from the mapping fromitoki. Whenl,iand?,iare close for all (k,r,n), we can actually bound the difference between the following Bellman operators as |(T ? l T l l)V l l,t(Ht1)| 3LN N X i=1 i(tl+ t). Then by applying
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- 冬季绿化养护与管理
- 宣传写作课件学习
- 宠物日常护理课件图片
- 二零二五年度新能源电池采购合同协议
- 2025版茶叶品牌重塑与市场拓展合同
- 二零二五年度个人消费贷款借款合同
- 二零二五年度测绘仪器采购与测绘项目验收服务合同
- 2025版跨国公司财务全球税务筹划合同
- 2025版高端医疗器械采购合同作废及供应商变更协议
- 二零二五年度阿拉尔经济技术开发区土地经营权流转合同
- 声环境质量自动监测系统质量保证及质量控制技术规范
- 2024年02月珠海市横琴粤澳深度合作区公安局2024年面向社会公开招考66名辅警笔试历年高频考点题库荟萃带答案解析
- 泡泡玛特营销案例分析
- 加工机械安全培训内容记录
- 对苯二甲酰氯的合成
- 大众进口途锐全车电路图01安装位置保险丝
- 酒店培训计划方案(通用8篇)
- 2023陕西延长石油集团矿业公司所属单位招聘666人笔试备考题库及答案解析
- 华北理工水质工程学教案09水的冷却-3冷却塔的热力计算基本方程、冷却塔的设计与计算
- 儿童手绘卡通word信纸贺卡背景模板
- 基业长青中国家族企业的东方智慧与长青之道
评论
0/150
提交评论