




版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
1、The Option Keyboard Combining Skills in Reinforcement Learning Andr Barreto, Diana Borsa, Shaobo Hou, Gheorghe Comanici, Eser Aygn, Philippe Hamel, Daniel Toyama, Jonathan Hunt, Shibl Mourad, David Silver, Doina Precup andrebarreto,borsa,shaobohou,gcomanici,eser hamelphi,kenjitoyama,jjhunt,shibl,dav
2、idsilver,doinap DeepMind Abstract The ability to combine known skills to create new ones may be crucial in the solution of complex reinforcement learning problems that unfold over extended periods. We argue that a robust way of combining skills is to defi ne and manipulate them in the space of pseud
3、o-rewards (or “cumulants”). Based on this premise, we propose a framework for combining skills using the formalism of options. We show that every deterministic option can be unambiguously represented as a cumulant defi ned in an extended domain. Building on this insight and on previous results on tr
4、ansfer learning, we show how to approximate options whose cumulants are linear combinations of the cumulants of known options. This means that, once we have learned options associated with a set of cumulants, we can instantaneously synthesise options induced by any linear combination of them, withou
5、t any learning involved. We describe how this framework provides a hierarchical interface to the environment whose abstract actions correspond to combinations of basic skills. We demonstrate the practical benefi ts of our approach in a resource management problem and a navigation task involving a qu
6、adrupedal simulated robot. 1Introduction In reinforcement learning (RL) an agent takes actions in an environment in order to maximise the amount of reward received in the long run 25 . This textbook defi nition of RL treats actions as atomic decisions made by the agent at every time step. Recently,
7、Sutton23proposed a new view on action selection. In order to illustrate the potential benefi ts of his proposal Sutton resorts to the following analogy. Imagine that the interface between agent and environment is a piano keyboard, with each key corresponding to a possible action. Conventionally the
8、agent plays one key at a time and each note lasts exactly one unit of time. If we expect our agents to do something akin to playing music, we must generalise this interface in two ways. First, we ought to allow notes to be arbitrarily longthat is, we must replace actions with skills. Second, we shou
9、ld be able to also play chords. The argument in favour of temporally-extended courses of actions has repeatedly been made in the literature: in fact, the notion that agents should be able to reason at multiple temporal scales is one of the pillars of hierarchical RL 7,18,26,8,17. The insight that th
10、e agent should have the ability to combine the resulting skills is a far less explored idea. This is the focus of the current work. The possibility of combining skills replaces a monolithic action set with a combinatorial counterpart: by learning a small set of basic skills (“keys”) the agent should
11、 be able to perform a potentially very large number of combined skills (“chords”). For example, an agent that can both walk and grasp an object should be able to walk while grasping an object without having to learn a new skill. According 33rd Conference on Neural Information Processing Systems (Neu
12、rIPS 2019), Vancouver, Canada. to Sutton23, this combinatorial action selection process “could be the key to generating behaviour with a good mix of preplanned coherence and sensitivity to the current situation.” But how exactly should one combine skills? One possibility is to combine them in the sp
13、ace of policies: for example, if we look at policies as distribution over actions, a combination of skills can be defi ned as a mixture of the corresponding distributions. One can also combine parametric policies by manipulating the corresponding parameters. Although these are feasible solutions, th
14、ey fail to capture possible intentions behind the skills. Suppose the agent is able to perform two skills that can be associated with the same objectivedistinct ways of grasping an object, say. It is not diffi cult to see how combinations of the corresponding behaviours can completely fail to accomp
15、lish the common goal. We argue that a more robust way of combining skills is to do so directly in the goal space, using pseudo-rewards or cumulants 25. If we associate each skill with a cumulant, we can combine the former by manipulating the latter. This allows us to go beyond the direct prescriptio
16、n of behaviours, working instead in the space of intentions. Combining skills in the space of cumulants poses two challenges. First, we must establish a well- defi ned mapping between cumulants and skills. Second, once a combined cumulant is defi ned, we must be able to perform the associated skill
17、without having to go through the slow process of learning it. We propose to tackle the former by adopting options as our formalism to defi ne skills 26. We show that there is a large subset of the space of options, composed of deterministic options, in which every element can be unambiguously repres
18、ented as a cumulant defi ned in an extended domain. Building on this insight, we extend Barreto et al.s 3,4 previous results on transfer learning to show how to approximate options whose cumulants are linear combinations of the cumulants of known options. This means that, once the agent has learned
19、options associated with a collection of cumulants, it can instantaneously synthesise options induced by any linear combination of them, without any learning involved. Thus, by learning a small set of options, the agent instantaneously has at its disposal a potentially enormous number of combined opt
20、ions. Since we are combining cumulants, and not policies, the resulting options will be truly novel, meaning that they cannot, in general, be directly implemented as a simple alternation of their constituents. We describe how our framework provides a fl exible interface with the environment whose ab
21、stract actions correspond to combinations of basic skills. As a reference to the motivating analogy described above, we call this interface the option keyboard. We discuss the merits of the option keyboard at the conceptual level and demonstrate its practical benefi ts in two experiments: a resource
22、 management problem and a realistic navigation task involving a quadrupedal robot simulated in MuJoCo 30,21. 2Background As usual, we assume the interaction between agent and environment can be modelled as a Markov decisionprocess(MDP)19. AnMDPisatupleM (S,A,p,r,), whereSandAarethestateand action sp
23、aces,p(|s,a)gives the next-state distribution upon taking actionains,r : S AS 7 R specifi es the reward associated with the transition s a s0, and 0,1) is the discount factor. The objective of the agent is to fi nd a policy : S 7 Athat maximises the expected return Gt P i=0 iRt+i, whereRt= r(St,At,S
24、t+1). A principled way to address this problem is to use methods derived from dynamic programming, which usually compute the action-value function of a policyas:Q(s,a) EGt|St= s,At= a,whereEdenotes expectation over the transitions induced by19. The computation ofQ(s,a)is called policy evaluation. On
25、cehas been evaluated, we can compute a greedy policy 0(s) argmaxaQ(s,a) for all s S.(1) It can be shown thatQ 0(s,a) Q(s,a) for all(s,a) S A, and hence the computation of0is referred to as policy improvement. The alternation between policy evaluation and policy improvement is at the core of many RL
26、algorithms, which usually carry out these steps approximately. Here we will use a tilde over a symbol to indicate that the associated quantity is an approximation (e.g., Q Q). 2.1Generalising policy evaluation and policy improvement Following Sutton and Barto25 , we call any signal defi ned asc : S
27、A S 7 Ra cumulant. Analogously to the conventional value functionQ , we defi neQ c as the expected discounted sum of 2 cumulantcunder policy27. Given a policyand a set of cumulantsC, we call the evaluation of under allc Cgeneralised policy evaluation (GPE) 2. Barreto et al.3,4 propose an effi cient
28、form of GPE based on successor features: they show that, given cumulantsc1,c2,.,cd, for any c = P iwici, with w R d, Q c(s,a) E X k=0 k d X i=1 wiCi,t+k|St= s,At= a # = d X i=1 wiQ ci(s,a), (2) whereCi,t ci(St,At,Rt). Thus, oncewe have computedQ c1,Qc1,.,Qcd, we caninstantaneously evaluate under any
29、 cumulant in the set C c = P iwici|w R d. Policy improvement can also be generalised. In Barreto et al.s 3 generalised policy improvement (GPI) the improved policy is computed based on a set of value functions. LetQ1 c ,Q2 c ,.Qn c be the action-value functions ofnpoliciesiunder cumulantc, and letQm
30、ax c (s,a) = maxiQi c (s,a) for all (s,a) S A. If we defi ne (s) argmaxaQmax c (s,a) for all s S,(3) thenQ c(s,a) Qmaxc (s,a)for all(s,a) S A. This is a strict generalisation of standard policy improvement (1). The guarantee extends to the case in which GPI uses approximations Qi c 3. 2.2Temporal ab
31、straction via options As discussed in the introduction, one way to get temporal abstraction is through the concept of options 26. Options are temporally-extended courses of actions. In their more general formulation, options can depend on the entire history between the timetwhen they were initiated
32、and the current time stept + k,ht:t+k statst+1.at+k1st+k. LetHbe the space of all possible histories; a semi-Markov option is a tupleo (Io,o,o)whereIo Sis the set of states where the option can be initiated,o: H 7 Ais a policy over histories, ando: H 7 0,1gives the probability that the option termin
33、ates after historyhhas been observed 26. It is worth emphasising that semi-Markov options depend on the history since their initiation, but not before. 3Combining options In the previous section we discussed how several key concepts in RL can be generalised: rewards with cumulants, policy evaluation
34、 with GPE, policy improvement with GPI, and actions with options. In this section we discuss how these concepts can be used to combine skills. 3.1The relation between options and cumulants We start by showing that there is a subset of the space of options in which every option can be unequivocally r
35、epresented as a cumulant defi ned in an extended domain. First we look at the relation between policies and cumulants. Given an MDP(S,A,p,), we say that a cumulantc: S A S 7 Rinduces a policy : S 7 Aifis optimal for the MDP (S,A,p,c,) . We can always defi ne a cumulantcthat induces a given policy. F
36、or instance, if we make c(s,a,) = ? 0 if a = (s); z otherwise, (4) wherez 0. Combined options associated with more general instantiations ofwwould correspond to “chords”. We will thus call the layer of temporal abstraction between algorithm and environment the option keyboard (OK). We will also gene
37、rically refer to the RL method interacting with OK as the “player”. Figure 1 shows how an RL agent can be broken into a player and an OK. Algorithm 1 shows a generic implementation of OK. Given a set of value functionsQEand a vector of weightsw, OK will execute the actions selected by GPE and GPI un
38、til the termination action is picked or a terminal state is reached. During this process OK keeps track of the discounted reward accumulated in the interaction with the environment (line 6), which will be returned to the player when the interaction terminates (line 10). As the optionseimay depend on
39、 the entire trajectory since their initiation, OK uses an update functionu(h,a,s0)that retains the parts of the history that are actually relevant for decision making (line 8). For example, if OK is based on Markov options only, one can simply use the update function u(h,a,s0) = s0. The setQE defi n
40、es a specifi c instantiation of OK; once an OK is in place any conventional RL method can interact with it as if it were the environment. As an illustration, Algorithm 2 shows how a keyboard player that uses a fi nite set of combined optionsW w1,w2,.,wncan be implemented using standardQ-learning by
41、simply replacing the environment with OK. It is worth pointing out that if we substitute any other set of weight vectorsW0forWwe can still use the same OK, without the need to relearn the value functions inQE. We can even use sets of abstract actions W that are infi niteas long as the OK player can deal with continuous action spaces 33, 24, 22. Although th
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- 模糊神经网络在船舶状态智能监测中的应用研究
- 景区行政执法管理办法
- 核酸混合试剂管理办法
- 电力大数据助力金融智能化风控
- 供热设备检修管理办法
- 公共卫生中心管理办法
- 物流行业的集聚效应、技术创新与高质量发展路径
- 培训机构审批管理办法
- 普货运输安全生产管理制度
- 教师培训方案:有效处理幼儿告状行为的策略探讨
- 2024年 北京市公务员考试(行测)考试真题试题(附答案)
- 既有建筑地基基础加固技术规范 JGJ 123-2012知识培训
- 2025至2030中国改装车行业深度发展研究与企业投资战略规划报告
- 中医执业医师历年真题及解答
- MT/T 1222-2024液压支架再制造工程设计指南
- 2025年7月浙江省普通高中学业水平考试历史仿真模拟卷01(含答案)
- 2024-2025学年人教版PEP六年级下学期期末试卷(含答案含听力原文无音频)
- 2025-2030年中国聚脲涂料行业市场现状供需分析及投资评估规划分析研究报告
- 一级建造师考试安全管理试题及答案
- 镀锌板知识课件
- 2025-2030偏光成像相机行业市场现状供需分析及重点企业投资评估规划分析研究报告
评论
0/150
提交评论