iccv2019论文全集8881-nonstochastic-multiarmed-bandits-with-unrestricted-delays_第1页
iccv2019论文全集8881-nonstochastic-multiarmed-bandits-with-unrestricted-delays_第2页
iccv2019论文全集8881-nonstochastic-multiarmed-bandits-with-unrestricted-delays_第3页
iccv2019论文全集8881-nonstochastic-multiarmed-bandits-with-unrestricted-delays_第4页
iccv2019论文全集8881-nonstochastic-multiarmed-bandits-with-unrestricted-delays_第5页
已阅读5页,还剩5页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

Nonstochastic Multiarmed Bandits with Unrestricted Delays Tobias Sommer Thune University of Copenhagen Copenhagen, Denmark tobias.thunedi.ku.dk Nicol Cesa-Bianchi DSRC upper bound on the delays dmax Truncate the learning rate: 0= min,(4edmax)1; Initialize wa 0 = 1 for all a K; for t = 1,2,. do Let pa t = wa t1 P bw b t1 for a K; Draw an action At K according to the distribution ptand play it; Observe feedback(s,As s )for alls : s + ds= tand construct estimatorsa s = a s1(a=As) pa s ; Update wa t = wa t1exp ? 0 P s:s+ds=t a s ? ; end The following theorem provides a regret bound for Algorithm 1. The bound is a generalization of a similar bound in Cesa-Bianchi et al. 2016. Theorem 1.Under the assumption that an upper bound on the delaysdmaxis known, the regret of Algorithm 1 with a learning rate against an oblivious adversary satisfi es RT max ?lnK ,4edmaxlnK ? + ?KTe 2 + D ? , 4 where D = PT t=1dt. In particular, if T and D are known and = q lnK KTe 2 +D 1 4edmax, we have RT 2 s? KTe 2 + D ? lnK.(1) The proof of Theorem 1 is based on proving the stability of the algorithm across rounds. The proof is sketched out in Section 5. As Theorem 1 shows, Algorithm 1 performs well ifdmaxis small and we also have preliminary knowledge ofdmax,T, andD. However, a single delay of orderTincreases dmaxup to orderT, which leads to a linear regret bound in Theorem 1. This is an undesired property, which we address with the skipping scheme presented next. 3.2Skipping scheme We introduce a wrapper for Algorithm 1, calledSkipper, which disregards feedback from rounds with excessively large delays. The regret in the skipped rounds is trivially bounded by 1 (because the losses are assumed to be in0,1) and the rounds are taken out of the analysis of the regret ofDEW. Skipperoperates with an externally provided thresholdand skips all rounds wheredt . The advantage of skipping is that it provides a natural upper bound on the delays for the subset of rounds processed by DEW,dmax= . Thus, we eliminate the need of knowledge of the maximal delay in the original problem. The cost of skipping is the number of skipped rounds, denoted by|S|, as captured in Lemma 2. Below we provide a regret bound for the combination ofSkipperand DEW. Algorithm 2: Skipper Input: Threshold ; Algorithm A. for t = 1,2,. do Get prediction Atfrom A and play it; Observe feedback (s,As s ) for all s : s + ds= t, and feed it to A for each s with ds ; end Lemma 2.The expected regret ofSkipperwith base algorithmAand threshold parameter satisfi es RT |S| + RTS,(2) where|S|is the number of skipped rounds (those for whichdt ) and RTSis a regret bound for running A on the subset of rounds TS(those, for which dt ). A proof of the lemma is found in Appendix C. When combined with the previous analysis forDEW, Lemma 2 gives us the following regret bound. Theorem 3.The expected regret ofSkipper(,DEW(,) against an oblivious adversary satisfi es RT |S| + max ?lnK ,4e lnK ? + ?KTe 2 + D ? ,(3) where D= P t/ Sdtis the cumulative delay experienced by DEW. Proof.Theorem 1 holds for parameters(,)forDEWrun underSkipper. We then apply Lemma 2. Corollary 4. Assume that T and D are known and take = 1 4e , = s eKT/2+D 4e + D 4elnK . Then the expected regret of Skipper(,DEW(,) against an oblivious adversary satisfi es RT 2 s? KTe 2 + (1 + 4e)D ? lnK. 5 Proof.Note thatD |S| |S| D . By substituting this into (3), observing thatD D, and substituting the values of and we obtain the result. Note that Corollary 4 recovers the regret scaling in Theorem 1, equation(1)within constant factors in front ofDwithout the need of knowledge ofdmax. Similar to Theorem 1, Corollary 4 is tight in the worst case. The tuning ofstill requires the knowledge ofTandD. In the next section we get rid of this requirement. 4Delay available at action time: Oracle tuning and results This section deals with the second setting, where the delays are observed before taking an action. The combined algorithm introduced in the previous section relies on prior knowledge ofTandD for tuning the parameters. In this section we eliminate this requirement by leveraging the added information about the delays at the time of action. The information is used in an implicit doubling scheme for tuningSkippers threshold parameter. Additionally, the new bound scales with the experienced delayDrather than the full delayD and is signifi cantly tighter whenD? D. This is achieved through direct optimization of the regret bound in terms of|S|andD, as opposed to Corollary 4, which tunes using the potentially loose inequality |S| D/. 4.1Setup Letmindex the epochs of the doubling scheme. In each epoch we restart the algorithm with new parameters and continually monitor the termination condition in equation(6). The learning rate within epochmis set tom= 1 4em, wherem is the threshold parameter of the epoch. Theorem 3 provides a regret bound for epoch m denoted by Boundm(m) := |Sm m| + 4emlnK + (m)eK/2 + Dm m 4em ,(4) where(m)denotes the length of epochmand|Sm m|andD m m are, respectively, the number of skipped rounds and the experienced delay within epoch m. Let m= 2m. In epoch m we set m= m 4elnK (5) and we stay in epoch m as long as the following condition holds: max ? |Sm m| 2, ?eK(m) 2 + Dm m ? lnK ? m.(6) Sincedtis observed at the beginning of roundt, we are able to evaluate condition(6)and start a new epoch before making the selection ofAt. This provides the desired tuning ofmfor all rounds without the need of a separate treatment of epoch transition points. While being more elaborate, this doubling scheme maintains the intuition of standard approaches. First of all, the condition for doubling(6)ensures that the regret bound in each period is optimized by explicitly balancing the contribution of each term in equation(4). Secondly, the geometric progression of the tuning(5)and thus of the resulting regret bounds means that the total regret bound summed over the epochs can be bounded in relation to the bound in the fi nal completed epoch. In the following we refer to the doubling scheme defi ned by (5) and (6) as Doubling. 4.2Results The following results show that the proposed doubling scheme works as well as oracle tuning of when the learning rate is fi xed at = 1/4e . We fi rst compare our performance to the optimal tuning in a single epoch, where we let m= argmin m Boundm(m)(7) be the minimizer of (4). 6 Lemma 5.The regret bound(4) for any non-fi nal epochm, with the epochs andmcontrolled by Doubling satisfi es Boundm(m) 3m 3Boundm( m) + 2e 2K lnK + 1. (8) ThelemmaisthemainmachineryoftheanalysisofDoublinganditsproofisprovidedinAppendixC. Applying it to Skipper(, DEW(,) leads to the following main result. Theorem 6. The expected regret of Skipper(,DEW(,) tuned by Doubling satisfi es for any T RT 15min ? |S| + 4e lnK + KT + D 4e ? + 10e2K lnK + 5. The proof of Theorem 6 is based on Lemma 5 and is provided in Appendix C. Corollary 7.The expected regret ofSkipper(,DEW(,)tuned byDoublingcan be relaxed for any T to RT 30 s? KTe 2 + (1 + 4e)D ? lnK + 10e2K lnK + 5.(9) Proof. The fi rst term in the bound of Theorem 6 can be directly bounded using Corollary 4. Note that both Theorem 6 and Corollary 7 require no knowledge of T and D. 4.3Comparison of the oracle and explicit bounds We fi nish the section with a comparison of the oracle bound in Theorem 6 and the explicit bound in Corollary 7. Ignoring the constant and additive terms, the bounds are explicit:O ?p (KT + D)lnK ? , oracle:O ? min ? |S| + lnK + KT + D ? . Note that the oracle bound is always as strong as the explicit bound. There are, however, cases where it is much tighter. Consider the following example. Example 8.Fort pKT/lnK letdt= T tand fort pKT/lnK letdt= 0. Take = pKT/lnK. Then D = (T pKT/lnK), but D= 0(assuming thatT K lnK) and |S| pKT/lnK. The corresponding regret bounds are explicit:O ?q KT lnK + T KT? = O?T3/4?, oracle:O ? KT lnK ? = O?T1/2?. 5Analysis of Algorithm 1 This section contains the main points of the analysis of Algorithm 1 leading to the proof of Theorem 1 which were postponed from Section 3. Full proofs are found in Appendix B. The analysis is a generalization of the analysis of delayedExp3in Cesa-Bianchi et al. 2016, and consists of a general regret analysis and two stability lemmas. 5.1Additional notation We letNt= |s : s+ds t,t+dt)|denote the stability-span oft, which is the amount of feedback that arrives between playing actionAtand observing its feedback. Note that lettingN = maxtNt we haveN 2maxtdt 2dmax, since this may include feedback from up tomaxsdsrounds prior to round t and up to dtrounds after round t. 7 We introduceZ = (z1,.,zT)to be a permutation ofT = 1,.,Tsorted in ascending order according to the value ofz + dzwith ties broken randomly, and leti= (z1,.,zi) be its fi rst ielements. Similarly, we also introduceZ0 t = (z0 1,.,z0Nt)as an enumeration ofs : s + ds t,t + dt). For a subset the integers C, corresponding to timesteps, we also introduce qa(C) = exp ? 0 P sC a s ? P bexp ? 0 P sC b s ?. (10) The nominator and denominator in the above expression will also be denoted bywa(C)andW(C) corresponding to the defi nition of pa t. By fi nally letting Ct1= s : s + ds t we have pa t = qa(Ct1). 5.2Analysis of delayed exponential weights The starting point is the following modifi cation of the basic lemma within theExp3analysis that takes care of delayed updates of the weights. Lemma 9. Algorithm 1 satisfi es T X t=1 K X a=1 pa t+dt a t min aK X t a t lnK 0 + 2 T X t=1 K X a=1 pa t+dt ? a t ?2 .(11) To make use of Lemma 9, we need to fi gure out the relationship betweenpa t+dt andpa t. This is achieved by the following two lemmas, which are generalizations and refi nements of Lemmas 1 and 2 in Cesa-Bianchi et al. 2016. Lemma 10. When using Algorithm 1 the resulting probabilities fulfi l for every t and a pa t+dt p a t 0 Nt X i=1 qa ?C t1 z0j: j i ? a z0 i, (12) where z0 jis an enumeration of s : s + ds t,t + dt). The above lemma allows us to boundpa t+dt from below in terms ofpa t. We similarly need to be able to upper bound the probability, which is captured in the second probability drift lemma. Lemma 11. The probabilities defi ned by (10) satisfy for any i qa(i) ? 1 + 1 2N 1 ? qa(i1).(13) 5.3Proof sketch of Theorem 1 By using Lemma 10 to bound the left hand side of (11) we have X t X a pa t a t min a X t a t lnK 0 + 0 2 T X t=1 K X a=1 pa t+dt ? a t ?2 + 0 X t X a a t Nt X i=1 qa ?C t1 z0j: j i ? a z0 i. Repeated use of Lemma 11 bounds the second term on the right hand side by0TKe/2in expectation. The third term on the right hand side can be bounded byD. Taking the maximum over the two possible values of the truncated learning rate fi nishes the proof. 8 6Discussion We have presented an algorithm for multiarmed bandits with variably delayed feedback, which achieves theO?p(KT + D)lnK?regret bound conjectured by Cesa-Bianchi et al. 2016. The algorithm is based on a procedure for skipping rounds with excessively large delays and refi ned analysis of the exponential weights algorithm with delayed observations. At the moment the skipping procedure requires prior knowledge ofTandDfor tuning the skipping threshold. However, if the delay information is available at action time, as in the examples described in the introduction, we provide a sophisticated doubling scheme for tuning the skipping threshold that requires no prior knowledge ofTandD . Furthermore, the refi ned tuning also leads to a refi ned regret bound of order O?min ?|S | + lnK + KT+D ?, which is polynomially tighter when D? D. We provide an example of such a problem in the paper. Our work leads to a number of interesting research questions. The main one is whether the two regret bounds are achievable when the delays are available at observation time without prior knowledge ofDandT. Alternatively, is it possible to derive lower bounds demonstrating the impossibility of further relaxation of the assumptions? More generally, it would be interesting to have refi ned lower bounds for problems with variably delayed feedback. Another interesting direction is a design of anytime algorithms, which do not rely on the doubling trick. Such algorithms can be used, for example, for achieving simultaneous optimality in stochastic and adversarial setups Zimmert and Seldin, 2019a. While a variety of anytime algorithms is available for non-delayed bandits, the extension to delayed feedback does not seem trivial. Some of these questions are addressed in a follow-up work by Zimmert and Seldin 2019b. Acknowledgments Nicol Cesa-Bianchi gratefully acknowledges partial support by the Google Focused Award Al- gorithms and Learning for AI (ALL4AI) and by the MIUR PRIN grant Algorithms, Games, and Digital Markets (ALGADIMAR). Yevgeny Seldin acknowledges partial support by the Independent Research Fund Denmark, grant number 9040-00361B. References Sakshi Arya and Yuhong Yang. Randomized allocation with nonparametric estimation for contextual multi-armed bandits with delayed rewards. arXiv preprint, arXiv:1902.00819, 2019. Nicol Cesa-Bianchi, Claudio Gentile, and Yishay Mansour. Delay and cooperation in nonstochastic bandits. Journal of Machine Learning Research, 49, 2016. Nicol Cesa-Bianchi, Claudio Gentile, and Yishay Mansour. Nonstochastic bandits with composite anonymous feedback. In Proceedings of the International Conference on Computational Learning Theory (COLT), 2018. Olivier Chapelle. Modeling delayed feedback in display advertising. In Proceedings of the Interna- tional Conference on Knowledge Discovery and Data Mining (ACM SIGKDD), 2014. Olivier Chapelle and Lihong Li. An empirical evaluation of Thompson sampling. In Advances in Neural Information Processing Systems (NeurIPS), 2011. Miroslav Dudk, Daniel J. Hsu, Satyen Kale, Nikos Karampatziakis, John Langford, Lev Reyzin, and Tong Zhang. Effi cient optimal learning for contextual bandits. In Proceedings of the Conference on Uncertainty in Artifi cial Intelligence, 2011. Siddhant Garg and Aditya Kumar Akash. Stochastic bandits with delayed composite anonymous feedback. arXiv preprint, arXiv:1910.01161, 2019. Scott Garrabrant, Nate Soares, and Jessica Taylor. Asymptotic convergence in online learning with unbounded delays. arXiv preprint, arXiv:1604.05280, 2016. Avishek Ghosh and Kannan Ramchandran. Online scoring with delayed information: A convex optimization viewpoint. In the proceedings of the Annual Allerton Conference on Communication, Control, and Computing (Allerton), 2018. 9 Pooria Joulani, Andras Gyorgy, and Csaba Szepesvri. Delay-tolerant online convex optimization: Unifi ed analysis and adaptive-gradient algorithms. In Proceedings of the AAAI Conference on Artifi cial Intelligence, 2016. Bingcong Li, Tianyi Chen, and Georgios B. Giannakis. Bandit online learning with unknown delays. In Proceedings of the International Conference on Artifi cial Intelligence and Statistics (AISTATS), 2019. Travis Mandel, Yun-En Liu, Emma Brunskill, and Zoran Popovi c. The queue method: Handling delay, heuristics, prior data, and evaluation in bandits. In Proceedings of the AAAI Conference on Artifi cial Intelligence, 2015. Timothy A Mann, Sven Gowal, Ray Jiang, Huiyi Hu, Balaji Lakshminarayanan, and Andras Gyorgy. Learningfromdelayedoutcomeswithinterme

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论