iccv2019论文全集8927-beyond-confidence-regions-tight-bayesian-ambiguity-sets-for-robust-mdps_第1页
iccv2019论文全集8927-beyond-confidence-regions-tight-bayesian-ambiguity-sets-for-robust-mdps_第2页
iccv2019论文全集8927-beyond-confidence-regions-tight-bayesian-ambiguity-sets-for-robust-mdps_第3页
iccv2019论文全集8927-beyond-confidence-regions-tight-bayesian-ambiguity-sets-for-robust-mdps_第4页
iccv2019论文全集8927-beyond-confidence-regions-tight-bayesian-ambiguity-sets-for-robust-mdps_第5页
已阅读5页,还剩5页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

1、Beyond Confi dence Regions: Tight Bayesian Ambiguity Sets for Robust MDPs Reazul Hasan Russel Department of Computer Science University of New Hampshire Marek Petrik Department of Computer Science University of New Hampshire Abstract Robust MDPs (RMDPs) can be use

2、d to compute policies with provable worst- case guarantees in reinforcement learning. The quality and robustness of an RMDP solution are determined by the ambiguity setthe set of plausible transition probabilitieswhich is usually constructed as a multi-dimensional confi dence region. Existing method

3、s construct ambiguity sets as confi dence regions using concentration inequalities which leads to overly conservative solutions. This paper proposes a new paradigm that can achieve better solutions with the same robustness guarantees without using confi dence regions as ambiguity sets. To incorporat

4、e prior knowledge, our algorithms optimize the size and position of ambiguity sets using Bayesian inference. Our theoretical analysis shows the safety of the proposed method, and the empirical results demonstrate its practical promise. 1Introduction Markov decision processes (MDPs) provide a versati

5、le framework for modeling reinforcement learning problems 4,33,38. However, they assume that transition probabilities and rewards are known exactly which is rarely the case. Limited data sets, modeling errors, value function approximation, and noisy data are common reasons for errors in transition p

6、robabilities 16,30,45. This results in policies that are brittle and fail when implemented. This is particularly true in the case of batch reinforcement learning 18, 20, 23, 32, 42. A promising framework for computing robust policies is based on Robust MDPs (RMDPs). RMDPs relax the need for precisel

7、y known transition probabilities. Instead, transition probabilities can take on any value from a so-called ambiguity set which represents a set of plausible transition probabilities 9,14,24,29,32,40,46,47. RMDPs are also reminiscent of dynamic zero-sum games: the decision maker chooses the best acti

8、ons, while the adversarial nature chooses the worst transition probabilities from the ambiguity set. The practical utility of using RMDPs has been hindered by the lack of good ways of constructing ambi- guity sets that lead to solutions that are robust without being too conservative. The standard ap

9、proach to constructing ambiguity sets from concentration inequalities 1,32,42,44 leads to theoretical guarantees but provides solutions that hopelessly conservative. Many problem-specifi c methods have been proposed too, but they are hard to use and typically lack fi nite-sample guarantees 3,5,16,28

10、. The main contribution of this work is to introduce a new method for constructing ambiguity sets that are both signifi cantly less conservative than existing ones 21,32,42 and also provide strong fi nite- sample guarantees. Similarly to some prior work on robust reinforcement learning and optimizat

11、ion, we use Bayesian assumptions to take advantage of domain knowledge which is often available 7, 8,13,47. Our main innovation is to realize that the natural approach to building ambiguity sets as confi dence intervals is unnecessarily conservative. Surprisingly, in the Bayesian setting, using a 33

12、rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada. 95% confi dence region for the transition probabilities is unnecessarily conservative to achieve 95% confi dence in the robustness of the solution. We also derive newL1concentration inequalities of possible ind

13、ependent interest. The remainder of the paper is organized as follows. Section 2 formally describes the framework and goals of the paper. Section 3 describes our main contribution, RSVF, a new method for constructing tight ambiguity sets from Bayesian models that are adapted to the optimal policy. W

14、e provide theoretical justifi cation for the robustness of RSVF, but detailed theoretical analysis of its performance guarantees is beyond the scope of this work. Then, Section 4 overviews related work and outlines methods that build ambiguity sets as frequentist confi dence regions or Bayesian cred

15、ible sets. Finally, Section 5 presents empirical results on several problem domains. 2Problem Statement: Data-driven RMDPs This section formalizes our goals and reviews relevant results for robust Markov decision pro- cesses (RMDPs). Throughout the paper, we use the symbolSto denote the probability

16、simplex in RS +. The symbols1and0denote vectors of all ones and zeros, respectively, of an appropriate size. The symbol I represents the identity matrix. 2.1Safe Return Estimate: VaR The underlying reinforcement learning problem is a Markov decision process with statesS = 1,.,Sand actionsA = 1,.,A.

17、The rewardsr : S A Rare known but the true transition probabilitiesP?: S A Sare unknown. The transition probability vector for a state sand an actionais denoted byp? s,a . As this is a batch reinforcement learning setting, a fi xed dataset Dof transition samples is provided:D = (si S,ai A,s0i S)i=1,

18、.,m. The only assumption aboutDis that the states0in(s,a,s0) Sis distributed according to the true transition probabilities s0 P?(s,a,), no assumptions are made on the sampling policy. Note that in the Bayesian approach, P?is a random variable and we assume to have a prior distribution available. Th

19、e objective is to maximize the standard -discounted infi nite horizon return 33. Because this paper analyzes the impact of using different transition probabilities, we use a subscript to indicate which ones are used. The optimal value function for some transition probabilitiesPis, therefore, denoted

20、 asv? P : S R, and the value function for a deterministic policy : S Ais denoted asv P. The set of all deterministic stationary policies is denoted by. The total return(,P)of a policy under transition probabilities P is: (,P) = pT 0v P, where p0is the initial distribution. Ideally, we could compute

21、a policy : S Athat maximizes the return(,P?), butP?is unknown. Ignoring the uncertainty inP?completely leads to brittle policies. Instead, a common objective in robust reinforcement learning is to maximize a plausible lower-bound on the return. Having a safe return estimate is very important since i

22、t can inform the stakeholder that the policy may not be good enough when deployed. The objective of computing a policythat maximizes a high-confi dence lower bound on the return can be expressed as 8, 21, 31, 42: max VR P?(,P ?), (1) whereVRis the popular value-at-risk measure at a risk level35. Thi

23、s objective is also sometimes known as percentile optimization 8. It is important to note that the risk metric is applied over possible values of the uncertain parameter and not over the distribution of returns. For example, ifVR0.05 P? (,P?) = 1then for5%of uncertain transition probabilitiesP?, the

24、 return is1or smaller. Because solving the optimization problem in(1)is NP-hard 8, we instead maximize a lower bound (). We call this lower bound a safe return estimate and it is defi ned as follows. Defi nition 2.1(Safe Return Estimate).The estimate : Rof return is called safe for a policy with pro

25、bability 1 if () VR P?(,P ? ), or in other words if it satisfi es: PP? h () (,P?) D i 1 . 2 Recall that under Bayesian assumptions,P?is a random variable and the guarantees are conditional on the datasetD. This is different from the frequentist approach, in which the random variable isDand the guara

26、ntees are conditional onP?. The relative merits of Bayesian versus frequentist approaches to robust optimization have been discussed in earlier work 8,47, but we emphasize that each approach presents a different set of advantages. An insightful discussion of the differences between the two approache

27、s can be found, for example, in Sections 5.2.2 and 6.1.1 of Murphy (2012). The following example will be used throughout the paper to demonstrate the proposed methods and visualize simple ambiguity sets. Example 2.1. Consider an MDP with 3 states:s1,s2,s3and a single actiona1. Assume that the true,

28、but unknown, transition probability isP?(s1,a1,) = 0.3,0.2,0.5. The known prior distribution overp? s1,a1is Dirichlet with concentration parameters = (1,1,1). The datasetDis comprised of3 occurrences of transitions(s1,a1,s1),2of transitions(s1,a1,s2), and5of transitions(s1,a1,s3). The posterior dist

29、ribution overp? sa,a1is also Dirichlet with = (4,3,6). Note that this is a probability distribution over transition probability distributions. Fig. 1 depicts the posterior distribution projected onto the probability simplex along with a 90% confi dence region centered on the posterior mean. 2.2Robus

30、t MDPs Robust Markov Decision Processes (RMDPs) are a convenient model and tractable model that generalizes MDPs. We will use RMDPs to maximize a tractable lower bound onVRobjective in (1)and compute a safe return estimate. Our RMDP model has the same statesS, actionsA, rewards rs,aas the MDP. The t

31、ransition probabilities for each statesand actiona, denoted asps,a S, are assumed chosen adversarialy from an ambiguity setPs,a. We usePto refer cumulatively toPs,afor all states s and actions a. We restrict our attention to sa-rectangular ambiguity sets, which allow the adversarial nature to choose

32、 the worst transition probability independently for each state and action 22,45. Limitations of rectangular ambiguity sets are known well 12,25,43 but they represent a simple, tractable, and practical model. A convenient way of defi ning ambiguity sets is to use a norm-distance from a given nominal

33、transition probability ps,a: Ps,a= ?p S : kp ps,ak1 s,a?(2) for a givens,a 0and a nominal point ps,a . We focus on ambiguity sets defi ned by theL1norm because they give rise to RMDPs that can be solved very effi ciently 15. RMDPs have properties that are similar to regular MDPs (see, for example, 2

34、,19,22,28,45). The robust Bellman operator b TPfor an ambiguity setPfor a statescomputes the best action with respect to the worst-case realization of the transition probabilities: (bTPv)(s) := max aA min pPs,a(rs,a + pTv)(3) The symbol b T P denotes a robust Bellman update for a given stationary po

35、licy. The optimal robust value function v?, and the robust value function vfor a policyare unique and must, similarly to MDPs, satisfy v?= b TP v?and v= b T P v . In general, we use a hat to denote quantities in the RMDP and omit it for the MDP. When the ambiguity setPis not obvious from the context

36、, we use it as a subscript v? P . The robust return is defi ned as 16: (,P) = min PP (,P) = pT 0 v P, wherep0 Sis the initial distribution. In the remainder of the paper, we describe methods that construct P from D in order to guarantee that is a tight lower bound on VR of the returns. 3Optimized Ba

37、yesian Ambiguity Sets In this section, we describe the new algorithm for constructing Bayesian ambiguity sets that can compute less-conservative lower bounds on the return. RSVF (robustifi cation with sensible value functions) is a Bayesian method that uses samples from the posterior distribution ov

38、erP?to construct tight ambiguity sets. 3 s1s2 s3 G G 0.00 0.25 0.50 0.75 0.000.250.500.751.00 Figure 1:Contours of the posterior distribution and the 90%-confi dence region. s1s2 s3 0.00 0.25 0.50 0.75 0.000.250.500.751.00 Figure 2: Optimal Bayesian ambiguity set (red) for a value function v = (0,0,

39、1). s1s2 s3 + G G 0.00 0.25 0.50 0.75 0.000.250.500.751.00 Figure 3:SetsKs1,a1(vi) (dashed red) fori = 1,2and Ls1,a1(v1,v2) (black). Before describing the algorithm, we use the setting of Example 2.1 to motivate our approach. To minimize distractions by technicalities, assume that the goal is to com

40、pute the return for a single time step starting from states1. Assume also that the value functionv = (1,0,0)is known, all rewards froms1are 0, and = 1. Recall that our goal is to construct a safe return estimate ()of VR0.1 P?(,P ?) at the 90% level. When the value function is known, it is possible t

41、o construct the optimal ambiguity set P?such that () = minpP?pTv = VR0.1 P?(,P ?) as: P?= n p 3: pTv VR0.1 P?(,P ?)o . It can be shown readily that this ambiguity set is optimal in the sense that any set for which ()is exact must be a subset ofP?13. Fig. 2 depicts the optimal ambiguity set along wit

42、h the arrow that indicates the direction along which v increases. The optimal ambiguity set described above cannot be used directly, unfortunately, because the value function is unknown. It would be tempting to construct the ambiguity set as the intersection of optimal sets for all possible value fu

43、nctions; a polyhedral approximation of this set is shown in Fig. 2 using a blue color. Unfortunately, this approach is not (usually) correct and will not lead to a safe return estimate. This can be shown from the fact that support functions to convex sets are convex and VR is not a convex (concave)

44、function 6, 34; see Gupta (2015) for a more detailed discussion. Since it is not possible, in general, to simply consider the intersection of optimal ambiguity sets for all possible value functions, we approximate the optimal ambiguity set for a few reasonable value functions. For this purpose, we u

45、se a set Ks,a (v) defi ned as follows: Ks,a(v) = n p S: pTv VR P? ?(p? s,a) Tv?o , where = 1 /(SA). The bottom dashed set in Fig. 3 depicts this setKforv = (0,0,1) in Example 2.1. The intuition behind this construction is as follows. If any ambiguity setPs,a intersectsKs,a( v P) for each states,athe

46、n the value function v P is safe:maxpKs,a(v)pTv VR P? ?(p? s,a)Tv ?. See Lemma B.2 for the formal statement. The setKs,a(v) is suffi cient, when the value function is known, but we need to generalize the approach to unknown value functions. The setLs,a(V)provides such a guarantee for a set of possib

47、le value functions (POV)V. Its center is chosen to minimize its size while intersectingKs,a(v)for each v in V and is constructed as follows. Ls,a(V) = ?p S : kp s,a(V)k1 s,a(V)? s,a(V) = min pS f(p),s,a(V) arg min pS f(p),f(p) = max vV min qKs,a(v)kq pk1 (4) The optimization in(4)can be represented

48、and solved as a linear program. Fig. 3 shows the setL in black solid color. It is the smallestL1-constrained set that intersects the twoKsets for value functions v1= (0,0,1) and v2= (2,1,0) in Example 2.1. We are now ready to describe RSVF, which is outlined in Algorithm 1. RSVF takes an optimistic

49、approach to approximating the optimal ambiguity set. It starts with a small set of potential optimal value functions (POV) and constructs an ambiguity set that is safe for these value functions. It keeps increasing the POV set until v?is in the set and the policy is safe. To simplify presentation, 4

50、 Algorithm 1: RSVF: Adapted Ambiguity Sets Input: Confi dence 1 and posterior PP? | D Output: Policy and lower bound () 1k 0; 2Pick some initial value function v0; 3Initialize POV: V0 ; 4repeat 5Augment POV: Vk+1 Vk vk ; 6For all s,a update Pk+1 s,a Ls,a(Vk+1) ; 7Solve vk+1 v? Pk+1 and k+1 ? Pk+1; 8

51、k k + 1 ; 9until safe for all s,a: Ks,a( vk) Pk s,a6= ; 10return ( k,pT 0 vk) ; Algorithm 1 is not guaranteed to terminate in fi nite time; the actual implementation switches to BCI described in Section 4.2 after 100 iterations, which guarantees its termination. The following theorem states that Alg

52、orithm 1 produces a safe estimate of the true return. Theorem 3.1.Suppose that Algorithm 1 terminates with a policy kand a value function vkin the iteration k. Then, the return estimate ( ) = pT 0 vkis safe: PP? h pT 0 vk pT0v k P? ? ? ? D i 1 . Before discussing the proof of Theorem 3.1, it is impo

53、rtant to mention its limitations. This result shows only that the return estimate is safe; it does not show that it is good. There are, of course, naive safe estimates such as () = (1 )1mins,ars,a. Since RSVF tightly approximates the optimal ambiguity sets, we expect it to perform signifi cantly bet

54、ter. The theoretical analysis of this of the approximation error of is beyond the scope of this work and we present empirical evidence in Section 5 instead. All proofs can be found in Appendix B. The proof is technical but conceptually simple. It is based on two main properties. The fi rst one is th

55、e construction of optimal ambiguity sets for the known value function as outlined above. The second is the fact that the ambiguity set needs to be robust with only with respect to the robust value function vand not the optimal value functionv?. This is subtle, but crucial since vis a constant whilev

56、?is a random variable in the Bayesian setting. The RSVF approach, therefore, does not work when frequentist guarantees are required. Confi dence regions, described in Section 4, are designed for situations when robustness is required with respect to a random variable, and are therefore overly conser

57、vative in our setting. See Appendix E for more in-depth discussion. 4 Ambiguity Sets as Confi dence Regions In this section, we describe the standard approach to constructing ambiguity sets as multidimensional confi dence regions and propose its extension to the Bayesian setting. Confi dence regions

58、 derived from concentration inequalities have been used previously to compute bounds on the true return in off-policy policy evaluation 41,42. These methods, unfortunately, do not readily generalize to the policy optimization setting, which we target. Other work has focused on reducing variance rather than on high-probability bounds 18,23,26. Methods for exploration in reinforcement learning, such as MBIE or UCRL2, also construct ambiguity sets using concentration inequalities 10,17,37,37,39

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论