版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
1、Explicit Explore-Exploit Algorithms in Continuous State Spaces Mikael Henaff Microsoft Research Abstract We present a new model-based algorithm for reinforcement learning (RL) which consists of explicit exploration and exploitation phases, and is applicable in large or infi nite state spaces. The al
2、gorithm maintains a set of dynamics models consistent with current experience and explores by fi nding policies which induce high dis- agreement between their state predictions. It then exploits using the refi ned set of models or experience gathered during exploration. We show that under realizabil
3、ity and optimal planning assumptions, our algorithm provably fi nds a near-optimal policy with a number of samples that is polynomial in a structural complexity measure which we show to be low in several natural settings. We then give a practical approximation using neural networks and demonstrate i
4、ts performance and sample effi ciency in practice. 1Introduction Whatisagoodalgorithmforsystematicallyexploringanenvironmentforthepurposeofreinforcement learning? A good answer could make the application of deep RL to complex problems 31,30,28,20 much more sample effi cient. In tabular Markov Decisi
5、on Processes (MDPs) with a small number of discrete states, model-based algorithms which perform exploration in a provably sample-effi cient manner have existed for over a decade 24,5,45 . The fi rst of these, known as the Explicit Explore- Exploit (E3) algorithm 24, progressively builds a model of
6、the environments dynamics. At each step, the agent uses this model to plan, either to explore and reach an unknown state, or to exploit and maximize its reward within the states it knows well. By actively seeking out unknown states, the algorithm provably learns a near-optimal policy using a number
7、of samples which is at most polynomial in the number of states. Many problems of interest, however, have a set of states which is infi nite or extremely large (for example, all images represented with fi nite precision), and in these settings, tabular algorithms are no longer applicable. In this wor
8、k, we propose a newE3-style algorithm which operates in large or continuous state spaces. The algorithm maintains a set of dynamics models which are consistent with the agents current experience, and explores the environment by executing policies designed to induce high disagreement between their pr
9、edictions. We show that under realizability and optimal planning assumptions, our algorithm provably fi nds a near-optimal policy using a number of samples from the environment which is independent of the number of states, and is instead polynomial in the rank of the model misfi t matrix, a structur
10、al complexity measure which we show to be low in natural settings such as small tabular MDPs, large MDPs with factored transition dynamics 23 and (potentially infi nite) low rank MDPs. We then present a practical version of the algorithm using neural networks, and demonstrate its performance and sam
11、ple effi ciency empirically on several challenging exploration problems with large or continuous state spaces. 33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada. Algorithm 1 (M,n,?,) 1:InputsInitial model setM, policy class, number of trajectoriesn, tolerance
12、?, model error. 2:M1 M 3:Initialize replay buffer R . 4:for t = 1,2,. do 5:t explore = argmax h vexplore(,Mt) i 6:if vexplore(t explore,Mt) ? |A| then 7:Collect dataset of n trajectories following t explore, add to replay buffer R 8:Mt+1 UpdateModelSet(Mt,R,) 9:else 10:Choose any M Mt 11:exploit= ar
13、gmax h vexploit(, M) i 12:Halt and return exploit 13:end if 14:end for 2Algorithm We consider an episodic, fi nite-horizon MDP setting defi ned by a tuple(S,A,M?,R?,H). Here S is a set of states (which could be large or infi nite),Ais a discrete set of actions,M?is the true (unknown) transition mode
14、l mapping state-action pairs to distributions over next states,R?is the true function mapping states to rewards in0,1, andHis the horizon length. For simplicity we assume rewards are part of the state and the agent has access toR?, so the task of predicting future rewards is included in that of pred
15、icting future states. A state s S at time step h will be denoted by sh. The general form of our algorithm is given by Algorithm 1. At each epocht, the algorithm maintains a set of dynamics modelsMtwhich are consistent with the experience accumulated so far, and searches for an exploration policy whi
16、ch will induce high disagreement between their predictions. If such a policy is found, it is executed and the set of models is updated to refl ect the new experience. Otherwise, the algorithm switches to its exploit phase and searches for a policy which will maximize its predicted future rewards. Le
17、tP ,h M ()denote the distribution over states at time stephinduced by sampling actions from policy and transitions from modelM, and letD(,M,M0,h) = (P ,h M (),P ,h M0 (), wheredenotes a distance measure between probability distributions such as KL divergence or total variation. The quantity which th
18、e exploration policy seeks to maximize at epoch t is given by: vexplore(,Mt) =max M,M0Mt H X h=1 D(,M,M0,h) Maximizing this quantity can be viewed as solving a fi ctitious exploration MDP, whose state space is the concatenation of|Mt|state vectors in the original MDP, whose transition matrix consist
19、s of a block-diagonal matrix whose blocks are the transition matrices of the models inMt, and whose reward function is the distance measured usingbetween the pairs of components of the state vector corresponding to different models. Importantly, searching for an exploration policy can be done intern
20、ally by the agent and does not require any environment interaction, which will be key to the algorithms sample effi ciency. Once the agent can no longer fi nd a policy which induces suffi cient disagreement between its candidate models in Mt, it chooses a model and computes an exploitation policy us
21、ing the models predicted reward: vexploit(,M) = H X h=1 X sh P ,h M (sh)R?(sh) 2 3Sample Complexity Analysis 3.1Algorithm Instantiation We fi rst give an instantiation of Algorithm 1, called DREEM (DisagReement-led Elimination of Environment Models), for which we will prove sample complexity results
22、. All proofs can be found in Appendix A. The algorithm starts with a large set of candidate modelsM, which is assumed to contain the true model, and iteratively eliminates models which are not consistent with the experience gathered through the exploration policy. We will show that the number of sam
23、ples needed to fi nd a near-optimal policy is independent of the number of states, and is instead polynomial in |A|,H,log|M|, and the rank of the model misfi t matrix , a quantity which we defi ne below and which is low in natural settings. DREEM is identical to Algorithm 1, with theUpdateModelSetsu
24、broutine instantiated as follows: Algorithm 2 UpdateModelSet(Mt,R,) 1:For eachM Mt,h H, compute f W(t explore,M,h)using data fromRcollected using the last exploration policy t explore. 2:Mt+1 M Mt: f W(t explore,M,h) for all h H 3:Return Mt+1 The quantityW(,M,h)can be thought of as the error of mode
25、lMin parts of the state space visited by at time step h, and is formally defi ned below: Defi nition 1. The misfi t of model M discovered by policy at time step h is given by: W(,M,h) = Esh1P,h1 M? ,ah1U(A) ?kP M(|sh1,ah1) PM?(|sh1,ah1)kTV ? The empirical misfi t estimated using a dataset collected
26、by following is denoted f W(,M,h). We will make use of the following two assumptions: Assumption 1. Mcontains the true modelM?andcontains optimal policies for all models inM. Assumption 2. The policy optimizations in Algorithm 1 are performed exactly. The fi rst is a standard realizability assumptio
27、n. The second assumes access to an optimal planner and has been used in several previous works 24,23,22,5. This does not mean that the planning problem is trivial, but is meant to separate the diffi culty of planning from that of exploration. We note that DREEM will not be computationally feasible f
28、or many problems since the setsMtwill often be large, and the algorithm requires iterating over them during both the elimination and planning steps. However, it distills the key ideas of Algorithm 1 and demonstrates its sample effi ciency when optimizations can be performed exactly. We will later gi
29、ve a practical instantiation of Algorithm 1 and demonstrate its sample effi ciency empirically. 3.2Structural Complexity Measure Since we are considering settings whereS is large or infi nite, it is not meaningful to give sample complexity results in terms of the number of states, as is often done f
30、or tabular algorithms. We instead use a structural complexity measure which is independent of the number of states, and depends on the maximum rank over a set of error matrices, which we defi ne next. 1 Defi nition 2. (Model Misfi t Matrices) LetMbe a model class and a policy class. Defi ne the set
31、of matrices A1,.,AH R|M|by Ah(,M) = W(,M,h) for all and M M. Using the ranks of error matrices as complexity measures of RL environments was proposed in 21,47 . Although the model misfi t matricesAhmay themselves be very large, we show next that their ranks are in fact small in several natural setti
32、ngs. 1We use a generalized notion of rank with a condition on the row norms of the factorization: for an m n matrixB, denoterank(B,)to be the smallest integerksuch thatB = UV withU Rmk,V Rnk and for every pair of rows ui,vjwe have kuik2 kvjk2 . appears in Lemma 3 and Theorem 1. 3 Proposition 1. Assu
33、me |S| is fi nite and let Ah be the matrix defi ned above. Then rank(Ah) |S|. Proposition 2.Letdenote the true transition matrix of size|S| |S A|, with(s0,(s,a) = PM?(s0|s,a). Assume that there exist two matrices1,2of sizes|S| KandK |S A|such that = 12. Then rank(Ah) K. The next proposition, which i
34、s a straightforward adaptation of a result from 47 2, shows that the ranks of the model misfi t matrices are also low in factored MDPs 23. Proposition 3.Consider a factored MDP setting where the state space is given byS = Odwhere d NandO is a small fi nite set, and the transition matrix has a factor
35、ed structure withLparameters. Then rank(Ah) L. 3.3Sample Complexity Now that we have defi ned our structural complexity measure, we prove sample complexity results for DREEM. We will use a slightly different defi nition ofD(,M,M0)than the one in Section 2, in that the last action is sampled uniforml
36、y: Defi nition 3. (Predicted Model Disagreement Induced by Policy) D(,M,M0,h) = X sh1 X ah1 X sh ? ? ?P M(sh|sh1,ah1)P ,h1 M (sh1)U(ah) PM0(sh|sh1,ah1)P ,h1 M0 (sh1)U(ah) ? ? ? We begin by proving a lemma which, intuitively, states that if a policy induces disagreement between two models of the envi
37、ronment, then it will also induce disagreement between at least one of these models and the true model. This means that by searching for and then executing a policy which induces disagreement between at least two models, the agent will collect experience from the environment which will enable it to
38、invalidate at least one of them. Lemma 1.LetMbe a set of models anda set of policies. If there existM,M0 M, and h Hsuch thatD(,M,M0,h) , then there existsh0 hsuch thatW(,M,h0) 4|A|H or W(,M0,h0) 4|A|H (or both). Next, we give a lemma which states that at any time step, the agent has either found an
39、exploration policy which will lead it to collect experience which allows it to reduce its set of candidate models, or has found an exploitation policy which is close to optimal. Herevis the value ofin the true MDP. Lemma 2.(Explore or Exploit) Suppose the true modelM?is never eliminated. At iteratio
40、nt, one of the following two conditions must hold: either there existsM Mt,ht Hsuch that W(t explore,M,ht) ? 4H2|A|2, or the algorithm returns exploit such that vexploit v? ?. The above two lemmas state that at any time step, the agent either reduces its set of candidate models or fi nds an exploita
41、tion policy which is close to optimal. However, since the initial set of candidate models may be very large, we need to ensure that many models are discarded at each exploration step. Our next lemma bounds the number of iterations of the algorithm by showing that the set of candidate models is reduc
42、ed by a constant factor at every step. Lemma 3.(Iteration Complexity) Letd = max1hHrank(Ah)and = ? 24H2|A|2d. Suppose that|fW(t explore,M,h) W( t explore,M,h)| holds for allt,h H andM M. Then the number of rounds of Algorithm 1 with theUpdateModelSetroutine given by Algorithm 2 is at most Hdlog( 2)/
43、log(5/3). The proof operates by representing each matrixAhin factored form, which induces an embedding of each model inMtin ad-dimensional space. Minimum volume ellipsoids are then constructed around these embeddings. A geometric argument shows that the volume of these ellipsoids shrinks 2Appendix E
44、.2, Proposition 2 4 by a constant factor from one iteration of the algorithm to the next, leading to a number of updates linear in d. Combining the previous lemmas with a concentration argument, we get our main result: Theorem 1.Assuming thatM?M, for any?,(0,1set= ? 24H2|A|2d and denoteT=Hdlog( 2)/l
45、og(5/3). Run Algorithm 1 with inputs(M,n,)wheren = (H4|A|4dlog(T|M|/)/?2), and theUpdateModelSetroutine is given by Algorithm 2. Then with probability at least1 , Algorithm 1 outputs a policyexploitsuch thatvexploit v? ?. The number of trajectories collected is at most O ? H5d2|A|4 ?2 log ? T|M| ? .
46、 Note that the above result requires knowledge ofdto set theandnparameters. If this quantity is unknown, it can be estimated using a doubling trick which does not affect the algorithms asymptotic sample complexity. Details can be found in Appendix A.2. 4Neural-E3: A Practical Instantiation The above
47、 analysis shows that Algorithm 1 is sample effi cient given an idealized instantiation, which may not be computationally practical for large model classes. Here we give a computationally effi cient instantiation called Neural-E3, which requires implementing theUpdateModelSetroutine and the planning
48、routines. 4.1Model Updates We representMtas an ensemble of action-conditional dynamics modelsM1,.,ME, parame- terized by neural networks, which are trained to model the next-state distributionPM?(sh+1|sh,a) using the data from the replay buffer R. The models are trained to minimize the following los
49、s: L(M,R) = E(sh+1,ah,sh)RlogPM(sh+1|sh,ah) The models inM1are initialized with random weights and the subroutineUpdateModelSetin Algorithm 1 takes as inputMt, performsNupdategradient updates to each of the models using different minibatches sampled fromR, and returns the updated set of modelsMt+1.
50、The dynamics models can be deterministic or stochastic (for example, Mixture Density Networks 4 or Variational Autoencoders 26). 4.2Planning The exploration and exploitation phases require computing a policy to optimizevexploreorvexploit and executing it in the environment. If the environment is det
51、erministic, policies can be represented as action sequences, in which case we use a generalized version of breadth-fi rst search applicable in continuous state spaces. This uses a priority queue, where expanded states are assigned a priority based on their minimum distance to other states in the cur
52、rently expanded search tree. Details can be found in Appendix B.2.1. For stochastic environments, we used implicit policies obtained using Monte-Carlo Tree Search (MCTS) 10, where each node in the tree consists of empirical distributions predicted by the different models conditioned on the action se
53、quence leading to the node. The agent only executes the fi rst action of the sequence returned by the planning procedure, and replans at every step to account for the stochasticity of the environment. See Appendix B.2.2 for details. 4.3Exploitation with Off-Policy RL For some problems with sparse re
54、wards, it may be computationally impractical to use planning during the exploitation phase, even with a perfect model. Note that much of the exploration phase, which uses model disagreement as a fi ctitious reward, can be seen as an MDP with dense rewards, while the rewards in the true MDP may be sp
55、arse. In these settings, we use an alternative approach where a parameterized value function such as a DQN 31 is trained using the experience collected in the replay buffer during exploration. This can be done offl ine without collecting additional samples from the environment. 5 4.4Relationship bet
56、ween Idealized and Practical Algorithms For both the idealized and practical algorithms,Mtrepresents a set of models with low error on the current replay buffer. In the idealized algorithm, models with high error are eliminated explicitly in Algorithm 2, while in the practical algorithm, models with
57、 high error are avoided by the optimization procedure. The main difference between the two algorithms is that the idealized version maintains all models in the model class which have low error (which includes the true model), whereas the practical version only maintains a subset due to time and memo
58、ry constraints. A potential failure mode of the practical algorithm would be if all the models wrongly agree in their predictions in some unexplored part of the state-action space which leads to high reward. However, in practice we found that using different initializations and minibatches was suffi
59、 cient to obtain a diverse set of models, and that using even a relatively small ensemble (4 to 8 models) led to successful exploration. 5Related Work Theoretical guarantees for a number of model-based RL algorithms exist in the tabular setting 24,5,45,43 and in the continuous setting when the dynamics are assumed to be linear 49,1,11. The Metric-E3algorithm 22 operates in general state spaces, but its sample complexity depends on the covering number which may be exponential in dimension. The algorithm of 29 addresses general mod
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- 2026年遗传实验预测试题及答案
- 2026年智力130测试题及答案
- 2026年与宿舍相关的测试题及答案
- 2026年考细节的测试题及答案
- 2026年合规意识测试题及答案
- 2026年初二透镜测试题及答案
- 2026年路虎大赛测试题及答案
- 2026年物业项目主管测试题及答案
- 第一章 问题研究:人类是否需要人造月亮 教学设计-高二上学期地理人教版(2019)选择性必修1
- 防暴指导员安全强化模拟考核试卷含答案
- 2026年机关单位档案管理应知应会知识测试题
- 2026年过程装备资产管理与完整性的结合
- 2026江苏苏州市健康养老产业发展集团有限公司下属子公司招聘44人(第一批)笔试历年典型考点题库附带答案详解
- 2026广东江门开平市招聘事业单位工作人员53人考试参考试题及答案解析
- 医药经销商现场审计制度
- 物业管理安全生产检查自查表样例
- 电力5G通信模组测试规范
- (2025版)微针点阵射频临床应用专家共识
- 2025年注册会计师公司战略与风险管理试题测试题及答案
- 船舶防污染监督制度
- 2026年高考物理上海卷含解析及答案
评论
0/150
提交评论