版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
1、Policy Continuation with Hindsight Inverse Dynamics Hao Sun1, Zhizhong Li1, Xiaotong Liu2, Dahua Lin1, Bolei Zhou1 1The Chinese University of Hong Kong,2Peking University Abstract Solving goal-oriented tasks is an important but challenging problem in reinforce- ment learning (RL). For such tasks, th
2、e rewards are often sparse, making it diffi cult to learn a policy effectively. To tackle this diffi culty, we propose a new approach called Policy Continuation with Hindsight Inverse Dynamics (PCHID). This ap- proach learns from Hindsight Inverse Dynamics based on Hindsight Experience Replay. Enabl
3、ing the learning process in a self-imitated manner and thus can be trained with supervised learning. This work also extends it to multi-step settings with Policy Continuation. The proposed method is general, which can work in isolation or be combined with other on-policy and off-policy algorithms. O
4、n two multi-goal tasks GridWorld and FetchReach, PCHID signifi cantly improves the sample effi ciency as well as the fi nal performance1. 1Introduction Imagine you are given the task of Tower of Hanoi with ten disks, what would you probably do to solve this complex problem? This game looks daunting
5、at the fi rst glance. However through trials and errors, one may discover the key is to recursively relocate the disks on the top of the stack from one pod to another, assisted by an intermediate one. In this case, you are actually learning skills from easier sub-tasks and those skills help you to l
6、earn more. This case exemplifi es the procedure of self-imitated curriculum learning, which recursively develops the skills of solving more complex problems. Tower of Hanoi belongs to an important kind of challenging problems in Reinforcement Learning (RL), namely solving the goal-oriented tasks. In
7、 such tasks, rewards are usually very sparse. For example, in many goal-oriented tasks, a single binary reward is provided only when the task is completed 1,2,3 . Previous works attribute the diffi culty of the sparse reward problems to the low effi ciency in experience collection 4. Thus many appro
8、aches have been proposed to tackle this prob- lem, including automatic goal generation 5, self-imitation learning 6, hierarchical reinforcement learning 7, curiosity driven methods 8,9, curriculum learning 1,10, and Hindsight Experience Replay (HER) 11. Most of these works guide the agent by demonst
9、rating on successful choices based on suffi cient exploration to improve learning effi ciency. Differently, HER opens up a new way to learn more from failures, assigning hindsight credit to primal experiences. However, it is limited by only applicable when combined with off-policy algorithms3. In th
10、is paper we propose an approach of goal-oriented RL called Policy Continuation with Hindsight Inverse Dynamics (PCHID), which leverages the key idea of self-imitate learning. In contrast to HER, our method can work as an auxiliary module for both on-policy and off-policy algorithms, or as an isolate
11、d controller itself. Moreover, by learning to predict actions directly from back-propagation through self-imitation 12, instead of temporal difference 13 or policy gradient 14,15,16,17, the data effi ciency is greatly improved. The contributions of this work lie in three aspects:(1)We introduce the
12、state-goal space partition for multi-goal RL and thereon defi ne Policy Continuation (PC) as a new approach to such tasks. 1Code and related materials are available at 33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada. (2)We propose Hindsight Inverse Dynamics
13、 (HID), which extends the vanilla Inverse Dynamics method to the goal-oriented setting.(3)We further integrate PC and HID into PCHID, which can effectively leverage self-supervised learning to accelerate the process of reinforcement learning. Note that PCHID is a general method. Both on-policy and o
14、ff-policy algorithms can benefi t therefrom. We test this method on challenging RL problems, where it achieves considerably higher sample effi ciency and performance. 2Related Work Hindsight Experience ReplayLearning with sparse rewards in RL problems is always a leading challenge for the rewards ar
15、e usually uneasy to reach with random explorations. Hindsight Experience Replay (HER) which relabels the failed rollouts as successful ones is proposed by Andrychowicz et al. 11 as a method to deal with such problem. The agent in HER receives a reward when reaching either the original goal or the re
16、labeled goal in each episode by storing both original transition pairs st,g,at,rand relabeled transitionsst,g,at,rin the replay buffer. HER was later extended to work with demonstration data 4 and boosted with multi-processing training 3. The work of Rauber et al. 18 further extended the hindsight k
17、nowledge into policy gradient methods using importance sampling. Inverse DynamicsGiven a state transition pair(st,at,st+1), the inverse dynamics 19 takes (st,st+1)as the input and outputs the corresponding actionat. Previous works used inverse dynamics to perform feature extraction 20,9,21 for polic
18、y network optimization. The actions stored in such transition pairs are always collected with a random policy so that it can barely be used to optimize the policy network directly. In our work, we use hindsight experience to revise the original transition pairs in inverse dynamics, and we call this
19、approach Hindsight Inverse Dynamics. The details will be elucidated in the next section. Auxiliary Task and Curiosity Driven MethodMirowski et al. 22 propose to jointly learn the goal-driven reinforcement learning problems with an unsupervised depth prediction task and a self-supervised loop closure
20、 classifi cation task, achieving data effi ciency and task performance improvement. But their method requires extra supervision like depth input. Shelhamer et al. 21 introduce several self-supervised auxiliary tasks to perform feature extraction and adopt the learned features to reinforcement learni
21、ng, improving the data effi ciency and returns of end-to-end learning. Pathak et al. 20 propose to learn an intrinsic curiosity reward besides the normal extrinsic reward, formulated by prediction error of a visual feature space and improved the learning effi ciency. Both of the approaches belong to
22、 self-supervision and utilize inverse dynamics during training. Although our method can be used as an auxiliary task and trained in self-supervised way, we improve the vanilla inverse dynamics with hindsight, which enables direct joint training of policy networks with temporal difference and self-su
23、pervised learning. 3Policy Continuation with Hindsight Inverse Dynamics In this section we fi rst briefl y go through the preliminaries in Sec.3.1. In Sec.3.2 we retrospect a toy example introduced in HER as a motivating example. Sec.3.3 to 3.6 describe our method in detail. 3.1Preliminaries Markov
24、Decision ProcessWe consider a Markov Decision Process (MDP) denoted by a tu- ple(S,A,P,r,), whereS,A are the fi nite state and action space,Pdescribes the transition probability asS A S 0,1.r : S Ris the reward function and 0,1is the discount factor. : S A 0,1denotes a policy, and an optimal policy
25、satisfi es= argmaxEs,aP t=0 tr(st) whereat (at|st),st+1 P(st+1|at,st)and ans0is given as a start state. When transition and policy are deterministic,= argmaxEs0P t=0 tr(st) and at= (st),st+1= T (st,at), where : S Ais deterministic andTmodels the deterministic transition dynamics. The expectation is
26、over all the possible start states. Universal Value Function Approximators and Multi-Goal RLThe Universal Value Function Approximator (UVFA) 23 extends the state space of Deep Q-Networks (DQN) 24 to include goal 2 020406080100 Number of frames(1e4) 14 12 10 8 6 Reward Combine Inverse Dynamics with H
27、ER HER + 0.10ID HER + 0.05ID HER + 0.03ID HER + 0.01ID HER ID /./) /0!1!. 21 ! Figure 1: (a): Results in bit-fl ipping problem. (b): An illustration of fl at state space. (c): An example of the GridWorld domain, which is a non-fl at case. stateg Gas part of the input,i.e.,stis extended to(st,g) S G.
28、 And the policy becomes : S G at, which is pretty useful in the setting where there are multiple goals to achieve. Moreover, Schauletal.23 show that in such a setting, the learned policy can be generalized to previous unseen state-goal pairs. Our application of UVFA on Proximal Policy Optimization a
29、lgorithm (PPO) 25 is straightforward. In the following of this work, we will use state-goal pairs to denote the extended state space(s,g) S G,at= (st,g)and(st+1,g) = T (st,at). The goalg is fi xed within an episode, but changed across different episodes. 3.2Revisiting the Bit-Flipping Problem The bi
30、t-fl ipping problem was provided as a motivating example in HER 11, where there arenbits with the state spaceS = 0,1nand the action spaceA = 0,1,.0,n1. An actionacorresponds to turn thea-th bit of the state. Each episode starts with a randomly generated states0and a random goal stateg. Only when the
31、 goal stategis reached the agent will receive a reward. HER proposed to relabel the failed trajectories to receive more reward signals thus enable the policy to learn from failures. However, the method is based on temporal difference thus the effi ciency of data is limited. As we can learn from fail
32、ures, here comes the question that can we learn a policy by supervised learning where the data is generated using hindsight experience? Inspired by the self-imitate learning ability of human, we aim to employ self-imitation to learn how to get success in RL even when the original goal has not yet ac
33、hieved. A straightforward way to utilize self-imitate learning is to adopt the inverse dynamics. However, in most cases the actions stored in inverse dynamics are irrelevant to the goals. Specifi cally, transition tuples like(st,g),(st+1,g),at)are saved to learn the inverse dynamics of goal-oriented
34、 tasks. Then the learning process can be executed simply as classifi cation when action space is discrete or regression when action space is continuous. Given a neural network parameterized by , the objective of learning inverse dynamics is as follows, = argmin X st,st+1,at |f(st,g),(st+1,g) at|2.(1
35、) Due to the unawareness of the goals while the agent is taking actions, the goalsgin Eq.(1) are only placeholders. Thus, it will cost nothing to replacegwithg= m(st+1)but result in a more meaningful form,i.e., encoding the following state as a hindsight goal. That is to say, if the agent wants to r
36、eachgfromst, it should take the action ofat, thus the decision making process is aware of the hindsight goal. We adoptftrained from Eq.(1) as an additional module incorporating with HER in the Bit-fl ipping environment, by simply adding up their logit outputs. As shown in Fig.1(a), such an additiona
37、l module leads to signifi cant improvement. We attribute this success to the fl atness of the state space. Fig.1(b) illustrates such a fl atness case where an agent in a grid map is required to reach the goalg3starting froms0: if the agent has already known how to reachs1in the east, intuitively, it
38、 has no problem to extrapolate its policy to reach g3in the farther east. Nevertheless, success is not always within an effortless single step reach. Reaching the goals ofg1 andg2are relatively harder tasks, and navigating from the start point to goal point in the GridWorld domain shown in Fig.1(c)
39、is even more challenging. To further employ the self-imitate learning and overcome the single step limitation of inverse dynamics, we come up with a new approach called Policy Continuation with Hindsight Inverse Dynamics. 3 3.3Perspective of Policy Continuation on Multi-Goal RL Task Our approach is
40、mainly based on policy continuation over sub-policies, which can be viewed as an emendation of the spontaneous extrapolation in the bit-fl ipping case. Defi nition 1: Policy Continuation(PC)Suppose is a policy function defi ned on a non-empty sub-state-spaceSUof the state spaceS, i.e.,SU S. IfSVis a
41、 larger subset ofS, containingSU, i.e., SU SV and is a policy function defi ned on SVsuch that (s) = (s)s SU then we calla policy continuation of, or we can say the restriction oftoSUis the policy function . Denote the optimal policy as : (st,gt) at, we introduce the concept of k-step solvability: D
42、efi nition 2: k-Step SolvabilityGiven a state-goal pair(s,g)as a task of a certain system with deterministic dynamics, if reaching the goalgneeds at leastksteps under the optimal policy starting froms, i.e., starting froms0= sand executeai= (si,g)fori = 0,1,.,k 1, the statesk= T (sk1,ak1) satisfi es
43、m(sk) = g, we call the pair(s,g)hask-step solvability, or(s,g) is k-step solvable. Ideally the k-step solvability means the number of steps it should take fromstog, given the maximum permitted action value. In practice thek-step solvability is an evolving concept that can gradually change during the
44、 learning process, thus is defi ned as whether it can be solve withk1withink steps after the convergence of k1trained on (k-1)-step HIDs. We follow HER to assume a mappingm : S Gs.t.s Sthe reward functionr(s,m(s) = 1, thus, the information of a goalgis encoded in states. For the simplest case we hav
45、emas identical mapping and G = S where the goal g is considered as a certain state s of the system. Following the idea of recursion in curriculum learning, we can divide the fi nite state-goal space into T + 2 parts according to their k-step solvability, S G = (S G)0 (S G)1 . (S G)T (S G)U(2) where(
46、s,g) S G,T is a fi nite time-step horizon that we suppose the task should be solved within, and(S G)i,i 0,1,2,.Tdenotes the set ofi-step solvable state-goal pairs,(s,g) (S G)U denotes unsolvable state-goal pairs,i.e.,(s,g)is notk-step solvable fork 0,1,2,.,T, and (S G)0is the trivial caseg = m(s0).
47、As the optimal policy only aims to solve the solvable state-goal pairs, we can take(S G)U out of consideration. It is clear that we can defi ne a disjoint sub-state-goal space union for the solvable state-goal pairs Defi nition 3: Solvable State-Goal Space PartitionGiven a certain environment, any s
48、olvable state-goal pairs can be categorized into only one sub state-goal space by the following partition S G(S G)U= T j=0 (S G)j(3) Then, we defi ne a set of sub-policiesi,i 0,1,2,.,Ton solvable sub-state-goal space Si j=0(S G)j respectively, with the following defi nition Defi nition 4: Sub Policy
49、 on Sub Spacei is a sub-policy defi ned on the sub-state-goal space (S G)i. We say i is an optimal sub-policy if it is able to solve alli-step solvable state-goal pair tasks in i steps. Corollary 1:If iis restricted as a policy continuation ofi1fori 1,2,.k,i is able to solve anyi-step solvable probl
50、em fori k . By defi nition, the optimal policyis a policy continuation of the sub policy T, and T is already a substitute for the optimal policy . We can recursively approximateby expanding the domain of sub-state-goal space in policy continuation from an optimal sub-policy 0. While in practice, we
51、use neural networks to approximate 4 such sub-policies to do policy continuation. We propose to parameterize a policy function = f bywith neural networks and optimizefby self-supervised learning with the data collected by Hindsight Inverse Dynamics (HID) recursively and optimize iby joint optimizati
52、on. 3.4Hindsight Inverse Dynamics One-Step Hindsight Inverse DynamicsOne step HID data can be collected easily. Withn randomly rollout trajectories(s0,g),a0,r0,(s1,g),a1,.,(sT,g),aT,rTi,i 1,2,.,n, we can use a modifi ed inverse dynamics by substituting the original goalgwith hindsight goalg= m(st+1)
53、for everystand result in(s0,m(s1),a0,(s1,m(s2),a1,.,(sT1,m(sT),aT1i,i 1,2,.,n. We can then fi t f1by 1= argmin X st,st+1,at |f(st,m(st+1),(st+1,m(st+1) at|2(4) By collecting enough trajectories, we can optimizefimplemented by neural networks with stochastic gradient descent 26. Whenmis an identical
54、mapping, the functionf1is a good enough approx- imator for 1, which is guaranteed by the approximation ability of neural networks 27,28,29. Otherwise, weshouldadaptEq. 4as1= argmin P st,st+1,at|f(st,m(st+1),m(st+1)at| 2, i.e., we should omit the state information in future statest+1, to regardf1as a
55、 policy. And in practice it becomes 1= argmin P st,st+1,at|f(st,m(st+1) at| 2. Multi-Step Hindsight Inverse DynamicsOnce we havefk1, an approximator of k1, k- step HID is ready to get. We can collect validk-step HID data recursively by testing whether thek-step HID state-goal pairs indeed needksteps
56、 to solve, i.e., for anyk-step transitions (st,g),at,rt,.,(st+k,g),at+k,rt+k, if our policy k1 at hand can not provide with an- other solution from(st,m(st+k)to(st+k,m(st+k)in less thanksteps, the state-goal pair (st,m(st+k)must bek-step solvable, and this pair together with the actionatwill be mark
57、ed as(s(k) t ,m(s(k) t+k),a (k) t . Fig.2 illustrates this process. The testing process is based on a function TEST()and we will focus on the selection ofTESTin Sec.3.6. Transition pairs like this will be collected to optimizek. In practice, we leverage joint training to ensurefkto be a policy conti
58、nuation of i,i 1,.,k i.e., k= argmin X s(i) t ,s(i) t+i,a (i) t ,i1,.,k |f(st,m(st+i),(st+i,m(st+i) at|2(5) 3.5Dynamic Programming Formulation For most goal-oriented tasks, the learning objective is to fi nd a policy to reach the goal as soon as possible. In such circumstances, L(st,g) = L(st+1,g) + 1(6) whereL(s,g) here is defi ned as the number of steps to be executed fromstogwith policyand 1 is the additional L (s t,g) = L (s t+1
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- 2026年浙江高考首考化学试卷真题解析及答案详解
- 盐业生产卤水净化调试技师(中级)考试试卷及答案
- 研磨细度检测工程师岗位招聘考试试卷及答案
- 压延成型辊距调试技师岗位招聘考试试卷及答案
- 2025年云南省蒙自市高二生物下册期末考试考试卷附参考答案(能力提升)
- 2026年河南省灵宝市高二生物下册期末考试检测卷附答案【A卷】
- 2026年云南省芒市高二生物下册期末考试测试卷及答案1套
- 2026年山东省高密市高二生物下册期末考试检测卷附答案(典型题)
- 2025年黑龙江省尚志市高二生物下册期末考试模拟卷带答案(综合题)
- 2026年江苏省丹阳市高二生物下册期末考试测试卷附参考答案【模拟题】
- 2026年全国高考语文(全国Ⅰ卷)真题及答案
- 2026年7月自考13996旅游接待业押题及答案
- 2026春西师大版小学数学四年级下册期末综合测试卷含答案
- IATF16949 五大核心工具综合培训(APQP-FMEA-SPC-MSA-PPAP)
- 2026年(春新版)道德与法治二年级下册1-4单元全套试卷
- 浙江省绍兴市柯桥区2024-2025学年七年级下学期期末数学试卷(含答案)
- 初中七年级道德与法治下册《让和声更美-集体生活中的个人与规则》教学设计
- DL-T-1878-2018燃煤电厂储煤场盘点导则
- JT-T-1202-2018城市公共汽电车场站配置规范
- 2025届河南省郑州市外国语高中物理高一第二学期期末统考试题含解析
- 文艺复兴经典名著选读智慧树知到期末考试答案章节答案2024年北京大学
评论
0/150
提交评论