版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
1、VIREL: A Variational Inference Framework for Reinforcement Learning Matthew FellowsAnuj MahajanTim G. J. RudnerShimon Whiteson Department of Computer Science University of Oxford Abstract Applying probabilistic models to reinforcement learning (RL) enables the uses of powerful optimisation tools suc
2、h as variational inference in RL. However, ex- isting inference frameworks and their algorithms pose signifi cant challenges for learning optimal policies, for example, the lack of mode capturing behaviour in pseudo-likelihood methods, diffi culties learning deterministic policies in maximum entropy
3、 RL based approaches, and a lack of analysis when function approxima- tors are used. We proposeVIREL, a theoretically grounded inference framework for RL that utilises a parametrised action-value function to summarise future dy- namics of the underlying MDP, generalising existing approaches.VIRELals
4、o benefi ts from a mode-seeking form of KL divergence, the ability to learn deter- ministic optimal polices naturally from inference, and the ability to optimise value functions and policies in separate, iterative steps. Applying variational expectation- maximisation toVIREL, we show that the actor-
5、critic algorithm can be reduced to expectation-maximisation, with policy improvement equivalent to an E-step and policy evaluation to an M-step. We derive a family of actor-critic methods fromVIREL, including a scheme for adaptive exploration and demonstrate that our algorithms outperform state-of-t
6、he-art methods based on soft value functions in several domains. 1Introduction Efforts to combine reinforcement learning (RL) and probabilistic inference have a long history, spanning diverse fi elds such as control, robotics, and RL 64,62,46,47,27,74,75,73,36. For- malising RL as probabilistic infe
7、rence enables the application of many approximate inference tools to reinforcement learning, extending models in fl exible and powerful ways 35. However, existing methods at the intersection of RL and inference suffer from several defi ciencies. Methods that derive from the pseudo-likelihood inferen
8、ce framework 12,64,46,26,44,1 and use expectation- maximisation (EM) favour risk-seeking policies 34, which can be suboptimal. Yet another approach, the MERL inference framework 35 (which we refer to asMERLIN), derives from maximum entropy reinforcement learning (MERL) 33,74,75,73. WhileMERLINdoes n
9、ot suffer from the issues of the pseudo-likelihood inference framework, it presents different practical diffi culties. These methods do not naturally learn deterministic optimal policies and constraining the variational policies to be deterministic renders inference intractable 47. As we show by way
10、 of counterexample in Section 2.2, an optimal policy under the reinforcement learning objective is not guaranteed from the optimal MERL objective. Moreover, these methods rely on soft value functions which are sensitive to a pre-defi ned temperature hyperparameter. Additionally, no existing framewor
11、k formally accounts for replacing exact value functions with function approximators in the objective; learning function approximators is carried out indepen- dently of the inference problem and no analysis of convergence is given for the corresponding algorithms. Equal Contribution.Correspondencetom
12、atthew.fellowscs.ox.ac.ukand anuj.mahajancs.ox.ac.uk. 33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada. This paper addresses these defi ciencies. We introduceVIREL, an inference framework that translates the problem of fi nding an optimal policy into an infe
13、rence problem. Given this framework, we demon- strate that applying EM induces a family of actor-critic algorithms, where the E-step corresponds exactly to policy improvement and the M-step exactly to policy evaluation. Using a variational EM algorithm, we derive analytic updates for both the model
14、and variational policy parameters, giving a unifi ed approach to learning parametrised value functions and optimal policies. We extensively evaluate two algorithms derived from our framework against DDPG 38 and an existing state-of-the-art actor-critic algorithm, soft actor-critic (SAC) 25, on a var
15、iety of OpenAI gym domains 9. While our algorithms perform similarly to SAC and DDPG on simple low dimensional tasks, they outperform them substantially on complex, high dimensional tasks. The main contributions of this work are: 1) an exact reduction of entropy regularised RL to prob- abilistic inf
16、erence using value function estimators; 2) the introduction of a theoretically justifi ed general framework for developing inference-style algorithms for RL that incorporate the uncertainty in the optimality of the action-value function, Q!(h), to drive exploration, but that can also learn optimal d
17、eterministic policies; and 3) a family of practical algorithms arising from our framework that adaptively balances exploration-driving entropy with the RL objective and outperforms the current state-of-the-art SAC, reconciling existing advanced actor critic methods like A3C 43, MPO 1 and EPG 10 into
18、 a broader theoretical approach. 2Background We assume familiarity with probabilistic inference 30 and provide a review in Appendix A. 2.1Reinforcement Learning Formally, an RL problem is modelled as a Markov decision process (MDP) defi ned by the tuple hS,A,r,p,p0,?i54,59, whereSis the set of state
19、s andA Rnthe set of available actions. An agent in states 2 Schooses an actiona 2 Aaccording to the policya (|s), forming a state-action pairh 2 H,h := hs,ai. This pair induces a scalar reward according to the reward functionrt:= r(ht) 2 Rand the agent transitions to a new states0 p(|h). The initial
20、 state distribution for the agent is given bys0 p0. We denote a sampled state-action pair at timestep tasht:= hst,ati. As the agent interacts with the environment using, it gathers a trajectory = (h0,r0,h1,r1,.). The value function is the expected, discounted reward for a trajectory, starting in sta
21、tes. The action-value function orQ-function is the expected, discounted reward for each trajectory, starting inh,Q(h) := Ep(|h)P1 t=0? trt, where p(|h) := p(s1|h0= h) Q1 t0=1p(st0+1|ht0)(at|st). Any Q -function satisfi es a Bellman equationT Q() = Q() whereT := r(h) + ?Eh0p(s0|h)(a0|s0) is the Bellm
22、an operator. We consider infi nite horizon problems with a discount factor? 2 0,1). The agent seeks an optimal policy2 argmaxJ, where J= Ehp0(s)(a|s)Q(h).(1) We denote optimalQ-functions asQ() := Q () and the set of optimal policies := argmaxJ. The optimal Bellman operator is T := r(h) + ?Eh0p(s0|h)
23、maxa0(). 2.2Maximum Entropy RL The MERL objective supplements each reward in the RL objective with an entropy term 61,74,75, 73,J merl := Ep() hPT?1 t=0 (rt? clog(at|st) i . The standard RL, undiscounted objective is recovered forc ! 0and we assumec = 1without loss of generality. The MERL objective
24、is often used to motivate the MERL inference framework (which we callMERLIN) 34, mapping the problem of fi nding the optimal policy, merl(a|s) = argmaxJ merl, to an equivalent inference problem. A full exposition of this framework is given by Levine35and we discuss the graphical model ofMERLIN in co
25、mparison toVIRELin Section 3.3. The inference problem is often solved using a message passing algorithm, where the log backward messages are called soft value functions due to their similarity to classic (hard) value functions 63,48,25,24,35. The softQ -function is defi ned asQ soft(h) := Eq(|h) h r
26、0+ PT?1 t=1 (rt? log(at|st) i , whereq(|h) := p(s0|h) QT?1 t=0 p(st+1|ht)(at|st). 2 The corresponding soft Bellman operator isT soft := r(h) + Eh0p(s0|h)(a0|s0) ? log(a 0|s0). Several algorithms have been developed that mirror existing RL algorithms using soft Bellman equations, including maximum en
27、tropy policy gradients 35, softQ-learning 24, and soft actor-critic (SAC) 25.MERL is also compatible with methods that use recall traces 21. s1 s11 sk1 1 s0 s2 s5 s3s4 s5+k2 Figure 1: A discrete MDP counterexample for op- timal policy under maximum entropy. We now outline key drawbacks ofMERLIN. It
28、is well- understood that optimal policies under regularised Bellman operators are more stochastic than under their equivalent unregularised operators 20. While this can lead to improved exploration, the optimal policy under these operators will still be stochastic, meaning optimal deterministic poli
29、cies are not learnt naturally. This leads to two diffi culties: 1) a de- terministic policy can be constructed by taking the actiona= argmaxa merl(a|s), corresponding to the maximum a posteriori (MAP) policy, however, in continuous domains, fi nding the MAP policy requires optimising theQ-function a
30、pproximator for actions, which is often a deep neural network. A common approximation is to use the mean of a variational policy instead; 2) even if we obtain a good approxi- mation, as we show below by way of counterexample, the deterministic MAP policy is not guaranteed to be the optimal policy un
31、derJ. Constraining the variational policies to the set of Dirac-delta distributions does not solve this problem either, since it renders the inference procedure intractable 47, 48. Next, we demonstrate that the optimal policy underJcannot always be recovered from the MAP policy underJ merl. Consider
32、 the discrete state MDP as shown in Fig. 1, with action set A = a1,a2,a1 1,a k1 1 and state setS = s0,s1,s2,s3,s4,s1 1s k1 1 ,s5,s5+k2. All state transitions are deterministic, withp(s1|s0,a1) = p(s1|s0,a2) = p(si 1|s1,ai1) = 1. All other state transitions are deterministic and independent of action
33、 taken, that is,p(sj|,sj?1) = 1 8 j 2and p(s5|,si 1) = 1. The reward function isr(s0,a2) = 1and zero otherwise. Clearly the optimal policy underJhas(a2|s0) = 1 . Defi ne a maximum entropy reinforcement learning policy asmerlwith merl(a1|s0) = p1,merl(a2|s0) = (1 ? p1)andmerl(ai 1|s1) = pi1. Formerla
34、ndk2 5, we can evaluate J merlfor any scaling constant c and discount factor ? as: J merl= (1 ? p1)(1 ? clog(1 ? p1) ? p1 clogp1+ ?c k X i=1 pi 1logp i 1 ! .(2) We now fi nd the optimal MERL policy. Note thatpi 1 = 1 k maximises the fi nal term in Eq. (2). Substituting forpi 1= 1 k1, then taking der
35、ivatives of Eq. (2) with respect top1, and setting to zero, we fi nd p 1= merl(a1|s0) as: 1 ? clog(1 ? p 1) = ?clog(k1) ? clogp 1, =) p 1= 1 k1?exp ?1 c ? + 1 , hence, for anyk1?exp ?1 c ? 1 2 and socannot be recovered from merl, even using the mode actiona1= argmaxa merl(a|s0). The degree to which
36、the MAP policy varies from the optimal unregularised policy depends on both the value ofcandk1, the later controlling the number of states with sub-optimal reward. Our counterexample illustrates that when there are large regions of the state-space with sub-optimal reward, the temperature must be com
37、paratively small to compensate, hence algorithms derived fromMERLINbecome very sensitive to temperature. As we discuss in Section 3.3, this problem stems from the fact thatMERLpolicies optimise for expected reward and long-term expected entropy. While initially benefi cial for exploration, this can
38、lead to sub-optimal polices being learnt in complex domains as there is often too little a priori knowledge about the MDP to make it possible to choose an appropriate value or schedule for c. Finally, a minor issue withMERLIN is that many existing models are defi ned for fi nite-horizon problems 35,
39、48. While it is possible to discount and extendMERLIN to infi nite-horizon problems, doing so is often nontrivial and can alter the objective 60, 25. 3 2.3Pseudo-Likelihood Methods A related but distinct approach is to apply Jensens inequality directly to the RL objectiveJ. Firstly, we rewrite Eq. (
40、1) as an expectation overto obtainJ = Ehp0(s)(a|s)Q(h) = Ep()R(), whereR() = PT?1 t=0 ?trtandp() = p0(s0)(a0|so) QT?1 t=0 p(ht+1|ht). We then treatp(R,) = R()p()as a joint distribution, and if rewards are positive and bounded, Jensens inequality can be applied, enabling the derivation of an evidence
41、 lower bound (ELBO). Inference algorithms such as EM can then be employed to fi nd a policy that optimises the pseudo-likelihood objective 12,64,46,26,44,1. Pseudo-likelihood methods can also be extended to a model-based setting by defi ning a prior over the environments transition dynamics. Furmsto
42、n similarly, we derive an off-policy operator based on a Boltzmann distribution with a diminishing temperature in Appendix F.2 that is a member ofT. Observe that soft Bellman operators are not members ofTas the optimal policy underJ merlis not deterministic, hence algorithms such as SAC cannot be de
43、rived from theVIRELframework. One problem remains: calculating the normalisation constant to sample directly from the Boltzmann distribution in Eq. (3) is intractable for many MDPs and function approximators. As such, we look to variational inference to learn an approximate variational policy(a|s) !
44、(a|s), parametrised by 2 with fi nite variance and the same support as!(a|s). This suggests optimising a new objective that penalises(a|s)when(a|s) 6= !(a|s)but still has a global maximum at!= 0. A tractable objective that meets these requirements is the evidence lower bound (ELBO) on the unnormalis
45、ed potential of the Boltzmann distribution, defi ned as!, 2 argmax!,L(!,), L(!,) := Esd(s) Ea(a|s) Q!(h) ! # + H (a|s) # ,(4) whereq(h) := d(s)(a|s)is a variational distribution,H ()denotes the differential entropy of a distribution, andd(s)is any arbitrary sampling distribution with support overS.
46、From Eq. (4), maximising our objective with respect to!is achieved when! 0and henceL(!,) satisfi es 1and2. As we show in Lemma 1,H ()in Eq. (4) causesL(!,) ! ?1whenever(a|s) is a Dirac-delta distribution for all! 0. This means our objective heavily penalises premature convergence of our variational
47、policy to greedy Dirac-delta policies except under optimality. We discuss a probabilistic interpretation of our framework in Appendix B, where it can be shown that !(a|s) characterises our models uncertainty in the optimality of Q!(h). We now motivateL(!,)from an inference perspective: In Appendix D
48、.1, we writeL(!,)in terms of the log-normalisation constant of the Boltzmann distribution and the KL divergence between the action-state normalised Boltzmann distribution,p!(h), and the variational distribution,q(h): L(!,) = (!) ? KL(q(h) k p!(h) ? H (d(s),(5) where(!) := log Z H exp Q!(h) ! ! dh,p!
49、(h) := exp Q!(h) ! R Hexp Q!(h) ! dh . As the KL divergence in Eq. (5) is always positive and the fi nal entropy term has no dependence on! or, maximising our objective foralways reduces the KL divergence between!(a|s)and(a|s) for any! 0, with(a|s) = !(a|s)achieved under exact representability (see
50、Theorem 3). This yields a tractable way to estimate!(a|s)at any point during our optimisation procedure by maximisingL(!,)for . From Eq. (5), we see that our objective satisfi es3, as we minimise the 5 mode-seeking direction of KL divergence,KL(q(h) k p!(h), and our objective is an ELBO, which is th
51、e starting point for inference algorithms 30,4,17. When the RL problem is solved and!= 0, our objective tends towards infi nity for any variational distribution that is non-deterministic (see Lemma 1). This is of little consequence, however, as whenever!= 0, our approximator is the optimal value fun
52、ction, Q!(h) = Q(h)(Theorem 2), and hence,(a|s)can be inferred exactly by fi ndingmaxa0 Q!(a0,s)or by using the policy gradientrEd(s)(a|s) h Q!(h) i (see Section 4.2). 3.2Theoretical Results We now formalise the intuition behind1-3. Theorem 1 establishes the emergence of a Dirac-delta distribution i
53、n the limit of! 0 . To the authors knowledge, this is the fi rst rigorous proof of this result. Theorem 2 shows that fi nding an optimal policy that maximises the RL objective in Eq. (1) reduces to fi nding the Boltzmann distribution associated with the parameters!2 argmax!L(!,) . The existence of s
54、uch a distribution is a suffi cient condition for the policy to be optimal. Theorem 3 shows that whenever! 0, maximising our objective foralways reduces the KL divergence between!(a|s)and(a|s), providing a tractable method to infer the current Boltzmann policy. Theorem 1(Convergence of Boltzmann Dis
55、tribution to Dirac Delta).Letp: X ! 0,1be a Boltzmann distribution with temperature 2 R?0,p(x) = exp(f(x) ) R X exp(f(x) )dx, wheref : X ! Yis a function that satisfi es Defi nition 1. In the limit ! 0, p(x) ! ?(x = supx0f(x0). Proof. See Appendix D.2 Lemma 1(Lower and Upper limits ofL(!,).i) For an
56、y! 0and(a|s) = ?(a), we have L(!,) = ?1. ii) For Q!(h) 0 and any non-deterministic (a|s), lim!0L(!,) = 1. Proof. See Appendix D.3. Theorem 2(Optimal Boltzmann Distributions as Optimal Policies).For!that maximisesL(!,) defi ned in Eq. (4), the corresponding Boltzmann policy induced must be optimal, i
57、.e.,!, 2 argmax!,L(!,) =) !(a|s) 2 . Proof. See Appendix D.3. Theorem3(Maximising the ELBO for).For any!0,maxL(!,)= Ed(s)minKL(a|s) k !(a|s) with !(a|s) = (a|s) under exact representability. Proof. See Appendix D.4. 3.3ComparingVIRELandMERLINFrameworks Figure 2: Graphical models forMERLINand VIREL(v
58、ariational approximations are dashed). To compareMERLINandVIREL, we consider the prob- abilistic interpretation of the two models discussed in Appendix B; introducing a binary variableO 2 0,1 defi nes a graphical model for our inference problem whenever! 0. Comparing the graphs in Fig. 2, observe thatMERLINmodels exponential cumulative rewards over entire trajectories. By contrast,VIRELs variational policy models a single step and a function approximator is used to model future expe
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- 三坐标测量机实操手册:Mizar Gold 设备人形机器人零件检测避坑指南
- 辽宁省葫芦岛市2026届高三上学期1月期末考试英语试卷(含答案无听力音频无听力原文)
- 广东省江门市2026届九年级上学期1月期末考试英语试卷(含答案无听力原文及音频)
- 化工企业属地管理培训
- 飞行安全管理课件
- 11月进出口数据点评:出口强在中游
- 飞机调试技术专家
- 飞机知识讲解课件
- 2026年广安市教育体育系统公开考核招聘体育专业技术人员备考考试题库及答案解析
- 2026甘肃嘉峪关市信访局招聘公益性岗位人员笔试备考试题及答案解析
- 大数据安全技术与管理
- 2026青岛海发国有资本投资运营集团有限公司招聘计划笔试备考试题及答案解析
- 鼻饲技术操作课件
- 置景服务合同范本
- 隧道挂防水板及架设钢筋台车施工方案
- 2025年国家市场监管总局公开遴选公务员面试题及答案
- 码头租赁意向协议书
- 初一语文2025年上学期现代文阅读真题(附答案)
- 2026届浙江绍兴市高三一模高考数学试卷试题(含答案)
- GB/T 33092-2016皮带运输机清扫器聚氨酯刮刀
- 中学主题班会课:期末考试应试技巧点拨(共34张PPT)
评论
0/150
提交评论