iccv2019论文全集8814-variance-reduced-policy-uation-with-smooth-function-approximation_第1页
iccv2019论文全集8814-variance-reduced-policy-uation-with-smooth-function-approximation_第2页
iccv2019论文全集8814-variance-reduced-policy-uation-with-smooth-function-approximation_第3页
iccv2019论文全集8814-variance-reduced-policy-uation-with-smooth-function-approximation_第4页
iccv2019论文全集8814-variance-reduced-policy-uation-with-smooth-function-approximation_第5页
已阅读5页,还剩7页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

1、Variance Reduced Policy Evaluation with Smooth Function Approximation Hoi-To Wai The Chinese University of Hong Kong Shatin, Hong Kong .hk Mingyi Hong University of Minnesota Minneapolis, MN, USA Zhuoran Yang Princeton University Princeton, NJ, USA Zhaor

2、an Wang Northwestern University Evanston, IL, USA zhaoranwang Kexin Tang University of Minnesota Minneapolis, MN, USA Abstract Policy evaluation with smooth and nonlinear function approximation has shown great potential for reinforcement learning. Compared to linear function approxi- ma

3、tion, it allows for using a richer class of approximation functions such as the neural networks. Traditional algorithms are based on two timescales stochastic approximation whose convergence rate is often slow. This paper focuses on an offl ine setting where a trajectory ofmstate-action pairs are ob

4、served. We formulate the policy evaluation problem as a non-convex primal-dual, fi nite-sum optimiza- tion problem, whose primal sub-problem is non-convex and dual sub-problem is strongly concave. We suggest a single-timescale primal-dual gradient algorithm with variance reduction, and show that it

5、converges to an-stationary point using O(m/) calls (in expectation) to a gradient oracle. 1Introduction In reinforcement learning (RL) 39, policy evaluation aims to estimate the value function that corresponds to a given policy. It serves as a crucial step in policy optimization algorithms 19,17,34,

6、 35 for solving RL tasks. Perhaps the most popular family of methods is temporal-difference (TD) 9, which estimates the value function by minimizing loss functions that are based on the Bellman equation. These methods can readily incorporate function approximations and have received huge empirical s

7、uccess, e.g., when the value functions are parametrized by deep neural networks 26,36. In contrast to the wide application of policy evaluation with nonlinear function approximation, most analytical results on policy evaluation focus on the linear setting 41,40,23,14,42,45,3,37,8. However, when it c

8、omes to nonlinear function approximation, TD methods can be divergent 2,43. To remedy, Bhatnagar et al.4proposed an online algorithm for minimizing a generalized mean- squared projected Bellman error (MSPBE) with smooth and nonlinear value functions. Asymptotic convergence of this algorithm is estab

9、lished based on two-timescale stochastic approximation 5,18 with diminishing step size. In a similar vein, Chung et al.7established the convergence of TD- learning with neural networks utilizing different step sizes for the top layer and the lower layers. However, non-asymptotic convergence results

10、for nonlinear policy evaluation remains an open problem, illustrating a clear gap between theory and practice. In this work, we make the fi rst attempt to bridge this gap studying policy evaluation with smooth and nonlinear function approximation. We focus on the offl ine setting where we are provid

11、ed withm 33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada. consecutive transitions from the policy to be evaluated, which is an important RL regime 20 and is closely related to the technique of experience replay 21. Our contributions are two-fold: We recast

12、the MSPBE minimization problem as a primal-dual optimization via the Fenchels duality. Here, the objective function is a fi nite-sum, and is non-convex in the primal, strongly concave in the dual constituting a one-sided non-convex primal-dual optimization problem. A variance reduced algorithm cf. n

13、PD-VR algorithm in Algorithm 1 is developed and applied to tackle the nonlinear policy evaluation problem. The algorithm performs primal-dual updates based on a single transition and has low computational complexity per-iteration. Unlike the existing algorithms, the proposed algorithm uses a fi xed

14、set of step sizes which is easier to tune and requires only a single for-loop to implement. We analyze the non-asymptotic performance of the algorithm and show that it converges to an-stationary point of the MSPBE within O(m/) calls to a gradient oracle, in expectation. Note that the optimization pr

15、oblem arisen is strictly more challenging than plain non-convex min- imization problems that are of recent interest, e.g., 30,1,15. For instance, naive gradient-based updates for this problem might exhibit bizarre behaviors such as cycling 10. To the best of our knowledge, the result in this paper c

16、onstitutes the fi rst convergence rate analysis for variance reduced policy evaluation with smooth and nonlinear function approximation. Related WorkOur work extends the research on policy evaluation with linear function approxi- mation 41,40,23,38,14,42,8,45,3,44,6,37,13; see 9 for a comprehensive

17、review. Among these work, our work is closely related to 14,44,6, which study single- and multi-agent policy evaluation in the offl ine setting. Besides, they utilize the Fenchels duality to obtain primal-dual optimization problems with a fi nite-sum structure, for which they provide variance-reduce

18、d opti- mization algorithms. Thanks to the linear function approximation, their objectives are strongly convex-concave, which enables linear rate of convergence. Furthermore, 4,7 seem to be the only convergent policy evaluation results with nonlinear function approximation. Both of their algorithms

19、utilize two-timescale step sizes, which may yield slow convergence. Moreover, their convergence results depends on two-timescale stochastic approximation 5,18, which uses the trajectory of an ODE to approximate that of a stochastic process. When specialized to an offl ine setting similar to ours, 4

20、can be viewed as the primal-dual stochastic gradient algorithm for our problem. From the optimization point of view, the non-convex primal-dual optimization (a.k.a non-convex min- maxproblem)thatariseintheabovenon-linearpolicyevaluationsettingisdiffi culttotackle. Although recent works have focused

21、on the non-convex minimization problems 30,1,15, only a few have focused on the non-convex min-max problems. Recently, Daskalakis and Panageas10, Daskalakis et al.11study the convergence of vanilla gradient descent/ascent (GDA), and the authors focused on bilinear problems (thus without the non-conv

22、ex component). An optimistic mirror descent algorithm is proposed in 25, and its convergence to a saddle point is established under certain strong coherence assumption. In 28, algorithms for robust machine learning problems have been proposed, where the problem is linear in one side and non-convex i

23、n another side. In 29, a proximally guided stochastic mirror descent method (PG-SMD) is proposed, which updates the variables simultaneously, while adopting a double loop update rule in which the variables are updated in “stages. These algorithms yield the convergence rate in the order ofO(1/pK)andO

24、(1/K1/4), respectively. Recently, an oracle based non-convex stochastic gradient descent for generative adversarial networks was proposed in 31,32, where the algorithm solves the maximization subproblem up to some small error. It was shown in 24 that a deterministic gradient descent/ascent min-max a

25、lgorithm hasO(1/K) convergence rate. In 27, theO(1/K)convergence rate was proved for nonconvex-nonconcave min-max problems under Polyak- ojasiewicz conditions. OrganizationIn 2 we describe the setup for the policy evaluation problem with smooth (possibly nonlinear) function approximation. In 3 we de

26、scribe the variance reduced method for policy evaluation and present the results from a preliminary numerical experiment. In 4 we provide the main convergence result in this paper for the proposed variance reduced method, in which a few key lemmas and a proof outline will be presented. 2 2Markov Dec

27、ision Process and Nonlinear Function Approximation Consider a Markov Decision Process (MDP) defi ned by ?S,A,P,R,?. We have denoted Sas the state space andAas the action space, notice that bothS,A can be infi nite. Lets 2 S,a 2 Abe a state and an action, respectively. For eacha 2 A, the operatorPais

28、 a Markov kernel describing the state transition upon taking action a. For any measurable function f on S, we have ?Paf?(s) = Z S Pa(s,s0)f(s0)(ds0), 8 s 2 S.(1) Lastly, the reward functionR(s,a)is the reward received after taking actionain statesand? 2 (0,1) is the discount factor. A policy is defi

29、 ned through the conditional probability(a|s)of taking actionagiven the current state s. Given a policy , the expected instantaneous reward at state s is defi ned as: R(s) := Ea(|s)R(s,a), 8 s 2 S.(2) In the policy evaluation problem, we are interested in the value functionV : S ! R that is defi ned

30、 as the discounted total reward over an infi nite horizon with the initial state fi xed at s 2 S: V (s) := E hP1 t=0? tR(st,at)|s0 = s,at (|st), st+1 Pat(st,) i .(3) LetM(S)be the manifold of value function given the state spaceS , we defi ne the Bellman operator T : M(S) ! M(S) as: ?T f?(s) := EhR(

31、s,a) + ?f(s0)|a (|s), s0 Pa(s,) i , 8 s 2 S,(4) wheref is any measurable function defi ned onS. DenoteV (s)(resp.R(s) as the average reward for the policy when initialized at a states 2 S. The Bellman equation 39 shows that the value function V : S ! R satisfi es V (s) = R(s) + ?P V?(s) = TV (s), 8

32、s 2 S, (5) where we have defi ned the operator P(,) as the expected Markov kernel of the policy : P (s,s0) := Z A Pa(s,s0)(a|s0)(da), 8 s,s02 S S.(6) In the above sense, the policy evaluation problem refers to solving forV : S ! R which satisfi es(5). 2.1Nonlinear Function Approximation Solving for

33、the functionV : S ! Rin(5)is a non-trivial task since the state spaceSis large (or even infi nite) and the expected Markov kernelP(,) is unknown. To address the fi rst issue, a common approach is to approximate V (s) by a parametric family of functions. This paper considers approximatingV : S ! Rfro

34、m the family of parametric and smooth functions given byF = V: 2 , whereis ad-dimensional parameter vector andis a compact, convex subset ofRd. Note thatFforms a differentiable manifold. For each,Vis a map fromSto Rand the function is non-linear w.r.t. As we consider the family of smooth functions,

35、the gradient and Hessian of V(s) w.r.t. exists and they are denoted as g(s) := ?r V ?(s) 2 Rd, H (s) := ?r2 V ?(s) 2 Rdd, (7) for eachs 2 Sand 2 . We defi neG:= Esp()g(s)g (s) 2 R dd, where p()is the stationary distribution of the MDP under policy. Throughout this paper, we assume thatGis a positive

36、 defi nite matrix for all 2 . To fi nd the best parameter?such thatV?: S ! Ris the closest approximation to a value function V that satisfi es(5), Bhatnagar et al.4proposed to minimize the mean squared projected bellman error (MSPBE) defi ned as follows: J() := 1 2 ? ?TV? V ? ?2 p(), (8) 3 where the

37、 weighted normkV k2 p() = R S p(s)|V (s)|2(ds) is defi ned with the stationary distribu- tionp(s)andis a projection onto the space of nonlinear functionsFw.r.t. the metrick kp(), i.e., for anyf : S ! R, we havef = argminV2Fkf ? Vk2 p(). The following identities are shown in 4: J() = 1 2Esp ()(T V(s)

38、 ? V(s)g(s) G?1 Esp()(T V(s) ? V(s)g(s) = 1 2 ? ? ?E sp() (T V(s) ? V(s)g(s)? ? 2 G?1 = max w2Rd ? 1 2Esp ()(wg(s)2+ w,E sp() (T V(s) ? V(s)g(s) (9) where the last equality is due to the Fenchels duality. With the above equivalence, the MSPBE minimization problem can be reformulated as a primal-dual

39、 optimization problem: min 2 max w2Rd L(,w),where(10) L(,w) := w,E sp() (T V(s) ? V(s)g(s) ? 1 2Esp ()(wg(s)2.(11) For convenience, we callas the primal variable andw as the dual variable. For any fi xed 2 , the functionL(,w)is strongly concave inwsinceG is positive defi nite. Moreover, the primal a

40、nd dual gradients are given respectively by: rL(,w) = Esp()?T V(s) ? V(s) ? g (s)w ?H (s)w + Esp()(g (s)w) ?E s0p(|s)g(s0) ? g(s) ?, rwL(,w) = Esp()(T V(s) ? V(s)g(s) ? Esp()g(s)g (s)w . (12) The above follows from the gradient of the temporal difference error: r?T V(s) ? V(s)? = ?Eg(s0)|s0 p(|s,a),

41、 a p(|s) ? g(s)(13) and we have denotedEs0p(|s)g(s0)as the expectation considered in the above. The primal dual projected gradient algorithm proceeds as (k+1)= P?(k)? k+1rL(k),w(k) w(k+1)= w(k)+ ?k+1rwL(k),w(k), (14) wherePdenotes the Euclidean projection onto the set. Applying the primal dual gradi

42、ent algorithm(14) is diffi cult as evaluating the gradientsrL(k),w(k),rwL(k),w(k)requires computing the expectations in(12)(and may require computing the second order moment of the quantities). In addition, while the problem(10)is strongly concave inw, it is potentially non-convex inas the functionV

43、()is non-linear with respect to 2 . It is unknown if the primal dual gradient algorithm will converge to a stationary (or saddle) point solution, and if it converges, the rate of convergence is unknown. 3Variance Reduced Policy Evaluation with Nonlinear Approximation We tackle the policy evaluation

44、problem with smooth function approximation via focusing on a sampled average version of problem(10) . To fi x idea, we observe a trajectory of state-action pairs s1,a1,s2,a2,.,sm,am,sm+1generated from the policythat we wish to evaluate and consider a sample average approximation of the stochastic ob

45、jective function (11) as: L(,w) := 1 m Pm i=1Li(,w), where Li(,w) := D w, R(s i,ai) + ?V(si+1) ? V(si) g (si) E ? 1 2(w g(si)2. (15) Our goal is to evaluate the stationary point (to be defi ned later) of the fi nite-sum, non-convex, primal- dual problem: min 2 max w2W L(,w) = 1 m Pm i=1Li(,w). (16)

46、4 Algorithm 1 Nonconvex Primal-Dual Gradient with Variance Reduction (nPD-VR) Algorithm. 1:Input: a trajectory of the state-action pairss1,a1,s2,a2,.,sm,am,sm+1generated from a given policy; step sizes ,? 0; initialization points 02 , w02 Rd. 2:Compute the initial averaged gradients as: G(0) = 1 m P

47、m i=1rLi( (0),w(0), G(0) w= 1 m Pm i=1rwLi( (0),w(0) (18) 3:for k = 0,1,2,.,K ? 1 do 4:Select two indices ik,jkindependently and uniformly from 1,.,m. 5:Perform the primal-dual updates: (k+1)= P n (k)? ? G(k) + ?r Lik(k),w(k) ? rLik( (k) ik ,w(k) ik )? o w(k+1)= w(k)+ G(k) w + ?r wLik(k),w(k) ? rwLi

48、k( (k) ik ,w(k) ik )? , (19) where the gradients can be given by (17). 6:Update the variables as: (k+1) i = ( (k)if i = jk (k) i if i 6= jk , w(k+1) i = ( w(k)if i = jk w(k) i if i 6= jk (20) G(k+1) = G(k) + 1 m ?r Ljk(k),w(k) ? rLjk( (k) jk ,w(k) jk )?, G(k+1) w = G(k) w + 1 m ?r wLjk(k),w(k) ? rwL

49、jk( (k) jk ,w(k) jk )?, (21) 7:end for 8:Return:( K),w(K), where Kis independently and uniformly picked from1,.,K an approximate stationary point to (16). Observe that ifm is suffi ciently large and asG is positive defi nite, the primal-dual objective function is strongly concave inwbut is possibly

50、non-convex indue to non-linearity. The above problem is hence a one-sided non-convex problem which remains challenging to tackle. An exact primal-dual gradient (PDG) algorithm following(14)but replacing the gradients ofL(,w) by that ofL(,w)may be applied to(15). In fact, through exploiting the one-s

51、ided non-convexity, Lu et al.24showed that a similar algorithm to the PDG algorithm indeed converges sublinearly to a stationary point of(16). However, for largem ? 1, implementing the PDG algorithm involves a high per-iteration complexity since evaluating the full gradient requires(m)FLOPS. Our ide

52、a is to derive a fast stochastic algorithm for function approximation through borrowing techniques from variance reduction methods 16, 12, 33, 30. To fi x notations, let i 2 1,.,m and we defi ne the primal-dual gradient of the ith samples: rLi(,w) rwLi(,w) ! = ? ?i() ? g (si)w ?H (si)w + (g (si)w) ?

53、g (si+1) ? g(si) ? ?i()g(si) ? ?g (si)w ?g (si) ! (17) where ?i() := R(si,ai) + ?V(si+1) ? V(si) is the ith sampled temporal difference. We propose the Nonconvex Primal-Dual Gradient with Variance Reduction (nPD-VR) algorithm for(16)in Algorithm 1. The algorithm is a natural extension of the non-con

54、vex SAGA algorithm introduced by 30 to the primal-dual, fi nite-sum setting of interest. In specifi c, line 5 performs the primal dual gradient update through an unbiased estimate of the gradient by denoting e G(k) := G(k) + ?r Lik(k),w(k) ? rLik( (k) ik ,w(k) ik )?, e G(k) w := G(k) w + ?r wLik(k),

55、w(k) ? rwLik( (k) ik ,w(k) ik )?, (22) asikis uniformly picked from1,.,m, therefore (when conditioned on the past random variable generated up to iterationk) the expected values of the quantities e G(k) ,eG(k) w are the primal-dual 5 gradientsrL(k),w(k),rwL(k),w(k), respectively. Meanwhile, the upda

56、tes in line 6 keep refreshing the stored variables in the memory. We remark that these updates are based on the index jkwhich is independent from theikused in line 5. As we shall see in the analysis, this subtle detail in the algorithm allows for proving that the variance in gradient is reduced cf.

57、Lemma 3. As the nPD-VR algorithm employs an incremental update rule similar to the SAGA method, this algorithm is suitable for the big-data setting whenm ? 1. Particularly, the cost for the updates in line 5 and line 6 are independent ofm . Moreover, the proposed algorithm utilizes a fi xed step siz

58、e rule which allows for adaptation to more dynamical data. We remark that existing approach 4,7 have studied a two-timescale stochastic approximation algorithm for tackling the stochastic problem(10); and in a similar vein, a recent related work 22 proposed a double loop algorithm that requires solving the dual problem (nearly) optimally. In contrast, the nPD-VR algorithm runs

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论