版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
1、Variance Reduced Policy Evaluation with Smooth Function Approximation Hoi-To Wai The Chinese University of Hong Kong Shatin, Hong Kong .hk Mingyi Hong University of Minnesota Minneapolis, MN, USA Zhuoran Yang Princeton University Princeton, NJ, USA Zhaor
2、an Wang Northwestern University Evanston, IL, USA zhaoranwang Kexin Tang University of Minnesota Minneapolis, MN, USA Abstract Policy evaluation with smooth and nonlinear function approximation has shown great potential for reinforcement learning. Compared to linear function approxi- ma
3、tion, it allows for using a richer class of approximation functions such as the neural networks. Traditional algorithms are based on two timescales stochastic approximation whose convergence rate is often slow. This paper focuses on an offl ine setting where a trajectory ofmstate-action pairs are ob
4、served. We formulate the policy evaluation problem as a non-convex primal-dual, fi nite-sum optimiza- tion problem, whose primal sub-problem is non-convex and dual sub-problem is strongly concave. We suggest a single-timescale primal-dual gradient algorithm with variance reduction, and show that it
5、converges to an-stationary point using O(m/) calls (in expectation) to a gradient oracle. 1Introduction In reinforcement learning (RL) 39, policy evaluation aims to estimate the value function that corresponds to a given policy. It serves as a crucial step in policy optimization algorithms 19,17,34,
6、 35 for solving RL tasks. Perhaps the most popular family of methods is temporal-difference (TD) 9, which estimates the value function by minimizing loss functions that are based on the Bellman equation. These methods can readily incorporate function approximations and have received huge empirical s
7、uccess, e.g., when the value functions are parametrized by deep neural networks 26,36. In contrast to the wide application of policy evaluation with nonlinear function approximation, most analytical results on policy evaluation focus on the linear setting 41,40,23,14,42,45,3,37,8. However, when it c
8、omes to nonlinear function approximation, TD methods can be divergent 2,43. To remedy, Bhatnagar et al.4proposed an online algorithm for minimizing a generalized mean- squared projected Bellman error (MSPBE) with smooth and nonlinear value functions. Asymptotic convergence of this algorithm is estab
9、lished based on two-timescale stochastic approximation 5,18 with diminishing step size. In a similar vein, Chung et al.7established the convergence of TD- learning with neural networks utilizing different step sizes for the top layer and the lower layers. However, non-asymptotic convergence results
10、for nonlinear policy evaluation remains an open problem, illustrating a clear gap between theory and practice. In this work, we make the fi rst attempt to bridge this gap studying policy evaluation with smooth and nonlinear function approximation. We focus on the offl ine setting where we are provid
11、ed withm 33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada. consecutive transitions from the policy to be evaluated, which is an important RL regime 20 and is closely related to the technique of experience replay 21. Our contributions are two-fold: We recast
12、the MSPBE minimization problem as a primal-dual optimization via the Fenchels duality. Here, the objective function is a fi nite-sum, and is non-convex in the primal, strongly concave in the dual constituting a one-sided non-convex primal-dual optimization problem. A variance reduced algorithm cf. n
13、PD-VR algorithm in Algorithm 1 is developed and applied to tackle the nonlinear policy evaluation problem. The algorithm performs primal-dual updates based on a single transition and has low computational complexity per-iteration. Unlike the existing algorithms, the proposed algorithm uses a fi xed
14、set of step sizes which is easier to tune and requires only a single for-loop to implement. We analyze the non-asymptotic performance of the algorithm and show that it converges to an-stationary point of the MSPBE within O(m/) calls to a gradient oracle, in expectation. Note that the optimization pr
15、oblem arisen is strictly more challenging than plain non-convex min- imization problems that are of recent interest, e.g., 30,1,15. For instance, naive gradient-based updates for this problem might exhibit bizarre behaviors such as cycling 10. To the best of our knowledge, the result in this paper c
16、onstitutes the fi rst convergence rate analysis for variance reduced policy evaluation with smooth and nonlinear function approximation. Related WorkOur work extends the research on policy evaluation with linear function approxi- mation 41,40,23,38,14,42,8,45,3,44,6,37,13; see 9 for a comprehensive
17、review. Among these work, our work is closely related to 14,44,6, which study single- and multi-agent policy evaluation in the offl ine setting. Besides, they utilize the Fenchels duality to obtain primal-dual optimization problems with a fi nite-sum structure, for which they provide variance-reduce
18、d opti- mization algorithms. Thanks to the linear function approximation, their objectives are strongly convex-concave, which enables linear rate of convergence. Furthermore, 4,7 seem to be the only convergent policy evaluation results with nonlinear function approximation. Both of their algorithms
19、utilize two-timescale step sizes, which may yield slow convergence. Moreover, their convergence results depends on two-timescale stochastic approximation 5,18, which uses the trajectory of an ODE to approximate that of a stochastic process. When specialized to an offl ine setting similar to ours, 4
20、can be viewed as the primal-dual stochastic gradient algorithm for our problem. From the optimization point of view, the non-convex primal-dual optimization (a.k.a non-convex min- maxproblem)thatariseintheabovenon-linearpolicyevaluationsettingisdiffi culttotackle. Although recent works have focused
21、on the non-convex minimization problems 30,1,15, only a few have focused on the non-convex min-max problems. Recently, Daskalakis and Panageas10, Daskalakis et al.11study the convergence of vanilla gradient descent/ascent (GDA), and the authors focused on bilinear problems (thus without the non-conv
22、ex component). An optimistic mirror descent algorithm is proposed in 25, and its convergence to a saddle point is established under certain strong coherence assumption. In 28, algorithms for robust machine learning problems have been proposed, where the problem is linear in one side and non-convex i
23、n another side. In 29, a proximally guided stochastic mirror descent method (PG-SMD) is proposed, which updates the variables simultaneously, while adopting a double loop update rule in which the variables are updated in “stages. These algorithms yield the convergence rate in the order ofO(1/pK)andO
24、(1/K1/4), respectively. Recently, an oracle based non-convex stochastic gradient descent for generative adversarial networks was proposed in 31,32, where the algorithm solves the maximization subproblem up to some small error. It was shown in 24 that a deterministic gradient descent/ascent min-max a
25、lgorithm hasO(1/K) convergence rate. In 27, theO(1/K)convergence rate was proved for nonconvex-nonconcave min-max problems under Polyak- ojasiewicz conditions. OrganizationIn 2 we describe the setup for the policy evaluation problem with smooth (possibly nonlinear) function approximation. In 3 we de
26、scribe the variance reduced method for policy evaluation and present the results from a preliminary numerical experiment. In 4 we provide the main convergence result in this paper for the proposed variance reduced method, in which a few key lemmas and a proof outline will be presented. 2 2Markov Dec
27、ision Process and Nonlinear Function Approximation Consider a Markov Decision Process (MDP) defi ned by ?S,A,P,R,?. We have denoted Sas the state space andAas the action space, notice that bothS,A can be infi nite. Lets 2 S,a 2 Abe a state and an action, respectively. For eacha 2 A, the operatorPais
28、 a Markov kernel describing the state transition upon taking action a. For any measurable function f on S, we have ?Paf?(s) = Z S Pa(s,s0)f(s0)(ds0), 8 s 2 S.(1) Lastly, the reward functionR(s,a)is the reward received after taking actionain statesand? 2 (0,1) is the discount factor. A policy is defi
29、 ned through the conditional probability(a|s)of taking actionagiven the current state s. Given a policy , the expected instantaneous reward at state s is defi ned as: R(s) := Ea(|s)R(s,a), 8 s 2 S.(2) In the policy evaluation problem, we are interested in the value functionV : S ! R that is defi ned
30、 as the discounted total reward over an infi nite horizon with the initial state fi xed at s 2 S: V (s) := E hP1 t=0? tR(st,at)|s0 = s,at (|st), st+1 Pat(st,) i .(3) LetM(S)be the manifold of value function given the state spaceS , we defi ne the Bellman operator T : M(S) ! M(S) as: ?T f?(s) := EhR(
31、s,a) + ?f(s0)|a (|s), s0 Pa(s,) i , 8 s 2 S,(4) wheref is any measurable function defi ned onS. DenoteV (s)(resp.R(s) as the average reward for the policy when initialized at a states 2 S. The Bellman equation 39 shows that the value function V : S ! R satisfi es V (s) = R(s) + ?P V?(s) = TV (s), 8
32、s 2 S, (5) where we have defi ned the operator P(,) as the expected Markov kernel of the policy : P (s,s0) := Z A Pa(s,s0)(a|s0)(da), 8 s,s02 S S.(6) In the above sense, the policy evaluation problem refers to solving forV : S ! R which satisfi es(5). 2.1Nonlinear Function Approximation Solving for
33、the functionV : S ! Rin(5)is a non-trivial task since the state spaceSis large (or even infi nite) and the expected Markov kernelP(,) is unknown. To address the fi rst issue, a common approach is to approximate V (s) by a parametric family of functions. This paper considers approximatingV : S ! Rfro
34、m the family of parametric and smooth functions given byF = V: 2 , whereis ad-dimensional parameter vector andis a compact, convex subset ofRd. Note thatFforms a differentiable manifold. For each,Vis a map fromSto Rand the function is non-linear w.r.t. As we consider the family of smooth functions,
35、the gradient and Hessian of V(s) w.r.t. exists and they are denoted as g(s) := ?r V ?(s) 2 Rd, H (s) := ?r2 V ?(s) 2 Rdd, (7) for eachs 2 Sand 2 . We defi neG:= Esp()g(s)g (s) 2 R dd, where p()is the stationary distribution of the MDP under policy. Throughout this paper, we assume thatGis a positive
36、 defi nite matrix for all 2 . To fi nd the best parameter?such thatV?: S ! Ris the closest approximation to a value function V that satisfi es(5), Bhatnagar et al.4proposed to minimize the mean squared projected bellman error (MSPBE) defi ned as follows: J() := 1 2 ? ?TV? V ? ?2 p(), (8) 3 where the
37、 weighted normkV k2 p() = R S p(s)|V (s)|2(ds) is defi ned with the stationary distribu- tionp(s)andis a projection onto the space of nonlinear functionsFw.r.t. the metrick kp(), i.e., for anyf : S ! R, we havef = argminV2Fkf ? Vk2 p(). The following identities are shown in 4: J() = 1 2Esp ()(T V(s)
38、 ? V(s)g(s) G?1 Esp()(T V(s) ? V(s)g(s) = 1 2 ? ? ?E sp() (T V(s) ? V(s)g(s)? ? 2 G?1 = max w2Rd ? 1 2Esp ()(wg(s)2+ w,E sp() (T V(s) ? V(s)g(s) (9) where the last equality is due to the Fenchels duality. With the above equivalence, the MSPBE minimization problem can be reformulated as a primal-dual
39、 optimization problem: min 2 max w2Rd L(,w),where(10) L(,w) := w,E sp() (T V(s) ? V(s)g(s) ? 1 2Esp ()(wg(s)2.(11) For convenience, we callas the primal variable andw as the dual variable. For any fi xed 2 , the functionL(,w)is strongly concave inwsinceG is positive defi nite. Moreover, the primal a
40、nd dual gradients are given respectively by: rL(,w) = Esp()?T V(s) ? V(s) ? g (s)w ?H (s)w + Esp()(g (s)w) ?E s0p(|s)g(s0) ? g(s) ?, rwL(,w) = Esp()(T V(s) ? V(s)g(s) ? Esp()g(s)g (s)w . (12) The above follows from the gradient of the temporal difference error: r?T V(s) ? V(s)? = ?Eg(s0)|s0 p(|s,a),
41、 a p(|s) ? g(s)(13) and we have denotedEs0p(|s)g(s0)as the expectation considered in the above. The primal dual projected gradient algorithm proceeds as (k+1)= P?(k)? k+1rL(k),w(k) w(k+1)= w(k)+ ?k+1rwL(k),w(k), (14) wherePdenotes the Euclidean projection onto the set. Applying the primal dual gradi
42、ent algorithm(14) is diffi cult as evaluating the gradientsrL(k),w(k),rwL(k),w(k)requires computing the expectations in(12)(and may require computing the second order moment of the quantities). In addition, while the problem(10)is strongly concave inw, it is potentially non-convex inas the functionV
43、()is non-linear with respect to 2 . It is unknown if the primal dual gradient algorithm will converge to a stationary (or saddle) point solution, and if it converges, the rate of convergence is unknown. 3Variance Reduced Policy Evaluation with Nonlinear Approximation We tackle the policy evaluation
44、problem with smooth function approximation via focusing on a sampled average version of problem(10) . To fi x idea, we observe a trajectory of state-action pairs s1,a1,s2,a2,.,sm,am,sm+1generated from the policythat we wish to evaluate and consider a sample average approximation of the stochastic ob
45、jective function (11) as: L(,w) := 1 m Pm i=1Li(,w), where Li(,w) := D w, R(s i,ai) + ?V(si+1) ? V(si) g (si) E ? 1 2(w g(si)2. (15) Our goal is to evaluate the stationary point (to be defi ned later) of the fi nite-sum, non-convex, primal- dual problem: min 2 max w2W L(,w) = 1 m Pm i=1Li(,w). (16)
46、4 Algorithm 1 Nonconvex Primal-Dual Gradient with Variance Reduction (nPD-VR) Algorithm. 1:Input: a trajectory of the state-action pairss1,a1,s2,a2,.,sm,am,sm+1generated from a given policy; step sizes ,? 0; initialization points 02 , w02 Rd. 2:Compute the initial averaged gradients as: G(0) = 1 m P
47、m i=1rLi( (0),w(0), G(0) w= 1 m Pm i=1rwLi( (0),w(0) (18) 3:for k = 0,1,2,.,K ? 1 do 4:Select two indices ik,jkindependently and uniformly from 1,.,m. 5:Perform the primal-dual updates: (k+1)= P n (k)? ? G(k) + ?r Lik(k),w(k) ? rLik( (k) ik ,w(k) ik )? o w(k+1)= w(k)+ G(k) w + ?r wLik(k),w(k) ? rwLi
48、k( (k) ik ,w(k) ik )? , (19) where the gradients can be given by (17). 6:Update the variables as: (k+1) i = ( (k)if i = jk (k) i if i 6= jk , w(k+1) i = ( w(k)if i = jk w(k) i if i 6= jk (20) G(k+1) = G(k) + 1 m ?r Ljk(k),w(k) ? rLjk( (k) jk ,w(k) jk )?, G(k+1) w = G(k) w + 1 m ?r wLjk(k),w(k) ? rwL
49、jk( (k) jk ,w(k) jk )?, (21) 7:end for 8:Return:( K),w(K), where Kis independently and uniformly picked from1,.,K an approximate stationary point to (16). Observe that ifm is suffi ciently large and asG is positive defi nite, the primal-dual objective function is strongly concave inwbut is possibly
50、non-convex indue to non-linearity. The above problem is hence a one-sided non-convex problem which remains challenging to tackle. An exact primal-dual gradient (PDG) algorithm following(14)but replacing the gradients ofL(,w) by that ofL(,w)may be applied to(15). In fact, through exploiting the one-s
51、ided non-convexity, Lu et al.24showed that a similar algorithm to the PDG algorithm indeed converges sublinearly to a stationary point of(16). However, for largem ? 1, implementing the PDG algorithm involves a high per-iteration complexity since evaluating the full gradient requires(m)FLOPS. Our ide
52、a is to derive a fast stochastic algorithm for function approximation through borrowing techniques from variance reduction methods 16, 12, 33, 30. To fi x notations, let i 2 1,.,m and we defi ne the primal-dual gradient of the ith samples: rLi(,w) rwLi(,w) ! = ? ?i() ? g (si)w ?H (si)w + (g (si)w) ?
53、g (si+1) ? g(si) ? ?i()g(si) ? ?g (si)w ?g (si) ! (17) where ?i() := R(si,ai) + ?V(si+1) ? V(si) is the ith sampled temporal difference. We propose the Nonconvex Primal-Dual Gradient with Variance Reduction (nPD-VR) algorithm for(16)in Algorithm 1. The algorithm is a natural extension of the non-con
54、vex SAGA algorithm introduced by 30 to the primal-dual, fi nite-sum setting of interest. In specifi c, line 5 performs the primal dual gradient update through an unbiased estimate of the gradient by denoting e G(k) := G(k) + ?r Lik(k),w(k) ? rLik( (k) ik ,w(k) ik )?, e G(k) w := G(k) w + ?r wLik(k),
55、w(k) ? rwLik( (k) ik ,w(k) ik )?, (22) asikis uniformly picked from1,.,m, therefore (when conditioned on the past random variable generated up to iterationk) the expected values of the quantities e G(k) ,eG(k) w are the primal-dual 5 gradientsrL(k),w(k),rwL(k),w(k), respectively. Meanwhile, the upda
56、tes in line 6 keep refreshing the stored variables in the memory. We remark that these updates are based on the index jkwhich is independent from theikused in line 5. As we shall see in the analysis, this subtle detail in the algorithm allows for proving that the variance in gradient is reduced cf.
57、Lemma 3. As the nPD-VR algorithm employs an incremental update rule similar to the SAGA method, this algorithm is suitable for the big-data setting whenm ? 1. Particularly, the cost for the updates in line 5 and line 6 are independent ofm . Moreover, the proposed algorithm utilizes a fi xed step siz
58、e rule which allows for adaptation to more dynamical data. We remark that existing approach 4,7 have studied a two-timescale stochastic approximation algorithm for tackling the stochastic problem(10); and in a similar vein, a recent related work 22 proposed a double loop algorithm that requires solving the dual problem (nearly) optimally. In contrast, the nPD-VR algorithm runs
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- 2025年建筑三类人员考试笔试及答案
- 中考物理简答题范例及答案
- 2025年街道疫情防控笔试题及答案
- 2025年小米材料笔试题库及答案
- 2025年沂源县事业编笔试及答案
- 2026年高考历史全真模拟试卷及答案(七)
- 年产2.3万吨速冻食品专用包装材料生产项目可行性研究报告
- 有机云雾茶产业开发项目可行性研究报告
- 2万吨茶油综合开发利用项目可行性研究报告
- 半导体存储高端封测项目申请报告
- 2026陕煤集团榆林化学有限责任公司招聘(162人)笔试参考题库及答案解析
- 2026贵阳市工业投资有限公司管培生招聘98人笔试参考题库及答案解析
- 2026年中国城市更新产业深度报告:城中村改造与基础设施升级策略
- 2026内蒙古地质矿产集团有限公司社会招聘65人备考题库带答案详解(预热题)
- 2026年新版三级安全教育考试试题及答案
- 公证处员工培训制度
- 低空经济中无人系统商业运营模式创新研究
- 2026年江苏省南京市高职单招数学考试试题及答案
- 三级医院血液净化护理质量评价标准
- 2023届上海市宝山区初三中考一模语文试卷+答案
- 黄酒酿造工艺学课件
评论
0/150
提交评论