iccv2019论文全集9215-policy-continuation-with-hindsight-inverse-dynamics_第1页
iccv2019论文全集9215-policy-continuation-with-hindsight-inverse-dynamics_第2页
iccv2019论文全集9215-policy-continuation-with-hindsight-inverse-dynamics_第3页
iccv2019论文全集9215-policy-continuation-with-hindsight-inverse-dynamics_第4页
iccv2019论文全集9215-policy-continuation-with-hindsight-inverse-dynamics_第5页
已阅读5页,还剩6页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

1、Policy Continuation with Hindsight Inverse Dynamics Hao Sun1, Zhizhong Li1, Xiaotong Liu2, Dahua Lin1, Bolei Zhou1 1The Chinese University of Hong Kong,2Peking University Abstract Solving goal-oriented tasks is an important but challenging problem in reinforce- ment learning (RL). For such tasks, th

2、e rewards are often sparse, making it diffi cult to learn a policy effectively. To tackle this diffi culty, we propose a new approach called Policy Continuation with Hindsight Inverse Dynamics (PCHID). This ap- proach learns from Hindsight Inverse Dynamics based on Hindsight Experience Replay. Enabl

3、ing the learning process in a self-imitated manner and thus can be trained with supervised learning. This work also extends it to multi-step settings with Policy Continuation. The proposed method is general, which can work in isolation or be combined with other on-policy and off-policy algorithms. O

4、n two multi-goal tasks GridWorld and FetchReach, PCHID signifi cantly improves the sample effi ciency as well as the fi nal performance1. 1Introduction Imagine you are given the task of Tower of Hanoi with ten disks, what would you probably do to solve this complex problem? This game looks daunting

5、at the fi rst glance. However through trials and errors, one may discover the key is to recursively relocate the disks on the top of the stack from one pod to another, assisted by an intermediate one. In this case, you are actually learning skills from easier sub-tasks and those skills help you to l

6、earn more. This case exemplifi es the procedure of self-imitated curriculum learning, which recursively develops the skills of solving more complex problems. Tower of Hanoi belongs to an important kind of challenging problems in Reinforcement Learning (RL), namely solving the goal-oriented tasks. In

7、 such tasks, rewards are usually very sparse. For example, in many goal-oriented tasks, a single binary reward is provided only when the task is completed 1,2,3 . Previous works attribute the diffi culty of the sparse reward problems to the low effi ciency in experience collection 4. Thus many appro

8、aches have been proposed to tackle this prob- lem, including automatic goal generation 5, self-imitation learning 6, hierarchical reinforcement learning 7, curiosity driven methods 8,9, curriculum learning 1,10, and Hindsight Experience Replay (HER) 11. Most of these works guide the agent by demonst

9、rating on successful choices based on suffi cient exploration to improve learning effi ciency. Differently, HER opens up a new way to learn more from failures, assigning hindsight credit to primal experiences. However, it is limited by only applicable when combined with off-policy algorithms3. In th

10、is paper we propose an approach of goal-oriented RL called Policy Continuation with Hindsight Inverse Dynamics (PCHID), which leverages the key idea of self-imitate learning. In contrast to HER, our method can work as an auxiliary module for both on-policy and off-policy algorithms, or as an isolate

11、d controller itself. Moreover, by learning to predict actions directly from back-propagation through self-imitation 12, instead of temporal difference 13 or policy gradient 14,15,16,17, the data effi ciency is greatly improved. The contributions of this work lie in three aspects:(1)We introduce the

12、state-goal space partition for multi-goal RL and thereon defi ne Policy Continuation (PC) as a new approach to such tasks. 1Code and related materials are available at 33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada. (2)We propose Hindsight Inverse Dynamics

13、 (HID), which extends the vanilla Inverse Dynamics method to the goal-oriented setting.(3)We further integrate PC and HID into PCHID, which can effectively leverage self-supervised learning to accelerate the process of reinforcement learning. Note that PCHID is a general method. Both on-policy and o

14、ff-policy algorithms can benefi t therefrom. We test this method on challenging RL problems, where it achieves considerably higher sample effi ciency and performance. 2Related Work Hindsight Experience ReplayLearning with sparse rewards in RL problems is always a leading challenge for the rewards ar

15、e usually uneasy to reach with random explorations. Hindsight Experience Replay (HER) which relabels the failed rollouts as successful ones is proposed by Andrychowicz et al. 11 as a method to deal with such problem. The agent in HER receives a reward when reaching either the original goal or the re

16、labeled goal in each episode by storing both original transition pairs st,g,at,rand relabeled transitionsst,g,at,rin the replay buffer. HER was later extended to work with demonstration data 4 and boosted with multi-processing training 3. The work of Rauber et al. 18 further extended the hindsight k

17、nowledge into policy gradient methods using importance sampling. Inverse DynamicsGiven a state transition pair(st,at,st+1), the inverse dynamics 19 takes (st,st+1)as the input and outputs the corresponding actionat. Previous works used inverse dynamics to perform feature extraction 20,9,21 for polic

18、y network optimization. The actions stored in such transition pairs are always collected with a random policy so that it can barely be used to optimize the policy network directly. In our work, we use hindsight experience to revise the original transition pairs in inverse dynamics, and we call this

19、approach Hindsight Inverse Dynamics. The details will be elucidated in the next section. Auxiliary Task and Curiosity Driven MethodMirowski et al. 22 propose to jointly learn the goal-driven reinforcement learning problems with an unsupervised depth prediction task and a self-supervised loop closure

20、 classifi cation task, achieving data effi ciency and task performance improvement. But their method requires extra supervision like depth input. Shelhamer et al. 21 introduce several self-supervised auxiliary tasks to perform feature extraction and adopt the learned features to reinforcement learni

21、ng, improving the data effi ciency and returns of end-to-end learning. Pathak et al. 20 propose to learn an intrinsic curiosity reward besides the normal extrinsic reward, formulated by prediction error of a visual feature space and improved the learning effi ciency. Both of the approaches belong to

22、 self-supervision and utilize inverse dynamics during training. Although our method can be used as an auxiliary task and trained in self-supervised way, we improve the vanilla inverse dynamics with hindsight, which enables direct joint training of policy networks with temporal difference and self-su

23、pervised learning. 3Policy Continuation with Hindsight Inverse Dynamics In this section we fi rst briefl y go through the preliminaries in Sec.3.1. In Sec.3.2 we retrospect a toy example introduced in HER as a motivating example. Sec.3.3 to 3.6 describe our method in detail. 3.1Preliminaries Markov

24、Decision ProcessWe consider a Markov Decision Process (MDP) denoted by a tu- ple(S,A,P,r,), whereS,A are the fi nite state and action space,Pdescribes the transition probability asS A S 0,1.r : S Ris the reward function and 0,1is the discount factor. : S A 0,1denotes a policy, and an optimal policy

25、satisfi es= argmaxEs,aP t=0 tr(st) whereat (at|st),st+1 P(st+1|at,st)and ans0is given as a start state. When transition and policy are deterministic,= argmaxEs0P t=0 tr(st) and at= (st),st+1= T (st,at), where : S Ais deterministic andTmodels the deterministic transition dynamics. The expectation is

26、over all the possible start states. Universal Value Function Approximators and Multi-Goal RLThe Universal Value Function Approximator (UVFA) 23 extends the state space of Deep Q-Networks (DQN) 24 to include goal 2 020406080100 Number of frames(1e4) 14 12 10 8 6 Reward Combine Inverse Dynamics with H

27、ER HER + 0.10ID HER + 0.05ID HER + 0.03ID HER + 0.01ID HER ID /./) /0!1!. 21 ! Figure 1: (a): Results in bit-fl ipping problem. (b): An illustration of fl at state space. (c): An example of the GridWorld domain, which is a non-fl at case. stateg Gas part of the input,i.e.,stis extended to(st,g) S G.

28、 And the policy becomes : S G at, which is pretty useful in the setting where there are multiple goals to achieve. Moreover, Schauletal.23 show that in such a setting, the learned policy can be generalized to previous unseen state-goal pairs. Our application of UVFA on Proximal Policy Optimization a

29、lgorithm (PPO) 25 is straightforward. In the following of this work, we will use state-goal pairs to denote the extended state space(s,g) S G,at= (st,g)and(st+1,g) = T (st,at). The goalg is fi xed within an episode, but changed across different episodes. 3.2Revisiting the Bit-Flipping Problem The bi

30、t-fl ipping problem was provided as a motivating example in HER 11, where there arenbits with the state spaceS = 0,1nand the action spaceA = 0,1,.0,n1. An actionacorresponds to turn thea-th bit of the state. Each episode starts with a randomly generated states0and a random goal stateg. Only when the

31、 goal stategis reached the agent will receive a reward. HER proposed to relabel the failed trajectories to receive more reward signals thus enable the policy to learn from failures. However, the method is based on temporal difference thus the effi ciency of data is limited. As we can learn from fail

32、ures, here comes the question that can we learn a policy by supervised learning where the data is generated using hindsight experience? Inspired by the self-imitate learning ability of human, we aim to employ self-imitation to learn how to get success in RL even when the original goal has not yet ac

33、hieved. A straightforward way to utilize self-imitate learning is to adopt the inverse dynamics. However, in most cases the actions stored in inverse dynamics are irrelevant to the goals. Specifi cally, transition tuples like(st,g),(st+1,g),at)are saved to learn the inverse dynamics of goal-oriented

34、 tasks. Then the learning process can be executed simply as classifi cation when action space is discrete or regression when action space is continuous. Given a neural network parameterized by , the objective of learning inverse dynamics is as follows, = argmin X st,st+1,at |f(st,g),(st+1,g) at|2.(1

35、) Due to the unawareness of the goals while the agent is taking actions, the goalsgin Eq.(1) are only placeholders. Thus, it will cost nothing to replacegwithg= m(st+1)but result in a more meaningful form,i.e., encoding the following state as a hindsight goal. That is to say, if the agent wants to r

36、eachgfromst, it should take the action ofat, thus the decision making process is aware of the hindsight goal. We adoptftrained from Eq.(1) as an additional module incorporating with HER in the Bit-fl ipping environment, by simply adding up their logit outputs. As shown in Fig.1(a), such an additiona

37、l module leads to signifi cant improvement. We attribute this success to the fl atness of the state space. Fig.1(b) illustrates such a fl atness case where an agent in a grid map is required to reach the goalg3starting froms0: if the agent has already known how to reachs1in the east, intuitively, it

38、 has no problem to extrapolate its policy to reach g3in the farther east. Nevertheless, success is not always within an effortless single step reach. Reaching the goals ofg1 andg2are relatively harder tasks, and navigating from the start point to goal point in the GridWorld domain shown in Fig.1(c)

39、is even more challenging. To further employ the self-imitate learning and overcome the single step limitation of inverse dynamics, we come up with a new approach called Policy Continuation with Hindsight Inverse Dynamics. 3 3.3Perspective of Policy Continuation on Multi-Goal RL Task Our approach is

40、mainly based on policy continuation over sub-policies, which can be viewed as an emendation of the spontaneous extrapolation in the bit-fl ipping case. Defi nition 1: Policy Continuation(PC)Suppose is a policy function defi ned on a non-empty sub-state-spaceSUof the state spaceS, i.e.,SU S. IfSVis a

41、 larger subset ofS, containingSU, i.e., SU SV and is a policy function defi ned on SVsuch that (s) = (s)s SU then we calla policy continuation of, or we can say the restriction oftoSUis the policy function . Denote the optimal policy as : (st,gt) at, we introduce the concept of k-step solvability: D

42、efi nition 2: k-Step SolvabilityGiven a state-goal pair(s,g)as a task of a certain system with deterministic dynamics, if reaching the goalgneeds at leastksteps under the optimal policy starting froms, i.e., starting froms0= sand executeai= (si,g)fori = 0,1,.,k 1, the statesk= T (sk1,ak1) satisfi es

43、m(sk) = g, we call the pair(s,g)hask-step solvability, or(s,g) is k-step solvable. Ideally the k-step solvability means the number of steps it should take fromstog, given the maximum permitted action value. In practice thek-step solvability is an evolving concept that can gradually change during the

44、 learning process, thus is defi ned as whether it can be solve withk1withink steps after the convergence of k1trained on (k-1)-step HIDs. We follow HER to assume a mappingm : S Gs.t.s Sthe reward functionr(s,m(s) = 1, thus, the information of a goalgis encoded in states. For the simplest case we hav

45、emas identical mapping and G = S where the goal g is considered as a certain state s of the system. Following the idea of recursion in curriculum learning, we can divide the fi nite state-goal space into T + 2 parts according to their k-step solvability, S G = (S G)0 (S G)1 . (S G)T (S G)U(2) where(

46、s,g) S G,T is a fi nite time-step horizon that we suppose the task should be solved within, and(S G)i,i 0,1,2,.Tdenotes the set ofi-step solvable state-goal pairs,(s,g) (S G)U denotes unsolvable state-goal pairs,i.e.,(s,g)is notk-step solvable fork 0,1,2,.,T, and (S G)0is the trivial caseg = m(s0).

47、As the optimal policy only aims to solve the solvable state-goal pairs, we can take(S G)U out of consideration. It is clear that we can defi ne a disjoint sub-state-goal space union for the solvable state-goal pairs Defi nition 3: Solvable State-Goal Space PartitionGiven a certain environment, any s

48、olvable state-goal pairs can be categorized into only one sub state-goal space by the following partition S G(S G)U= T j=0 (S G)j(3) Then, we defi ne a set of sub-policiesi,i 0,1,2,.,Ton solvable sub-state-goal space Si j=0(S G)j respectively, with the following defi nition Defi nition 4: Sub Policy

49、 on Sub Spacei is a sub-policy defi ned on the sub-state-goal space (S G)i. We say i is an optimal sub-policy if it is able to solve alli-step solvable state-goal pair tasks in i steps. Corollary 1:If iis restricted as a policy continuation ofi1fori 1,2,.k,i is able to solve anyi-step solvable probl

50、em fori k . By defi nition, the optimal policyis a policy continuation of the sub policy T, and T is already a substitute for the optimal policy . We can recursively approximateby expanding the domain of sub-state-goal space in policy continuation from an optimal sub-policy 0. While in practice, we

51、use neural networks to approximate 4 such sub-policies to do policy continuation. We propose to parameterize a policy function = f bywith neural networks and optimizefby self-supervised learning with the data collected by Hindsight Inverse Dynamics (HID) recursively and optimize iby joint optimizati

52、on. 3.4Hindsight Inverse Dynamics One-Step Hindsight Inverse DynamicsOne step HID data can be collected easily. Withn randomly rollout trajectories(s0,g),a0,r0,(s1,g),a1,.,(sT,g),aT,rTi,i 1,2,.,n, we can use a modifi ed inverse dynamics by substituting the original goalgwith hindsight goalg= m(st+1)

53、for everystand result in(s0,m(s1),a0,(s1,m(s2),a1,.,(sT1,m(sT),aT1i,i 1,2,.,n. We can then fi t f1by 1= argmin X st,st+1,at |f(st,m(st+1),(st+1,m(st+1) at|2(4) By collecting enough trajectories, we can optimizefimplemented by neural networks with stochastic gradient descent 26. Whenmis an identical

54、mapping, the functionf1is a good enough approx- imator for 1, which is guaranteed by the approximation ability of neural networks 27,28,29. Otherwise, weshouldadaptEq. 4as1= argmin P st,st+1,at|f(st,m(st+1),m(st+1)at| 2, i.e., we should omit the state information in future statest+1, to regardf1as a

55、 policy. And in practice it becomes 1= argmin P st,st+1,at|f(st,m(st+1) at| 2. Multi-Step Hindsight Inverse DynamicsOnce we havefk1, an approximator of k1, k- step HID is ready to get. We can collect validk-step HID data recursively by testing whether thek-step HID state-goal pairs indeed needksteps

56、 to solve, i.e., for anyk-step transitions (st,g),at,rt,.,(st+k,g),at+k,rt+k, if our policy k1 at hand can not provide with an- other solution from(st,m(st+k)to(st+k,m(st+k)in less thanksteps, the state-goal pair (st,m(st+k)must bek-step solvable, and this pair together with the actionatwill be mark

57、ed as(s(k) t ,m(s(k) t+k),a (k) t . Fig.2 illustrates this process. The testing process is based on a function TEST()and we will focus on the selection ofTESTin Sec.3.6. Transition pairs like this will be collected to optimizek. In practice, we leverage joint training to ensurefkto be a policy conti

58、nuation of i,i 1,.,k i.e., k= argmin X s(i) t ,s(i) t+i,a (i) t ,i1,.,k |f(st,m(st+i),(st+i,m(st+i) at|2(5) 3.5Dynamic Programming Formulation For most goal-oriented tasks, the learning objective is to fi nd a policy to reach the goal as soon as possible. In such circumstances, L(st,g) = L(st+1,g) + 1(6) whereL(s,g) here is defi ned as the number of steps to be executed fromstogwith policyand 1 is the additional L (s t,g) = L (s t+1

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

最新文档

评论

0/150

提交评论