




版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
1、Regularizing Trajectory Optimization with Denoising Autoencoders Rinu Boney Aalto University & Curious AI rinu.boneyaalto.fi Norman Di Palo Sapienza University of Rome normandipalo Mathias Berglund Curious AI Alexander Ilin Aalto University & Curious AI Juho Kannala Aalto University Antti Rasmus Cur
2、ious AI Harri Valpola Curious AI Abstract Trajectory optimization using a learned model of the environment is one of the core elements of model-based reinforcement learning. This procedure often suffers from exploiting inaccuracies of the learned model. We propose to regularize trajectory optimizati
3、on by means of a denoising autoencoder that is trained on the same trajectories as the model of the environment. We show that the proposed regularization leads to improved planning with both gradient-based and gradient- free optimizers. We also demonstrate that using regularized trajectory optimizat
4、ion leads to rapid initial learning in a set of popular motor control tasks, which suggests that the proposed approach can be a useful tool for improving sample effi ciency. 1Introduction State-of-the-art reinforcement learning (RL) often requires a large number of interactions with the environment
5、to learn even relatively simple tasks 11. It is generally believed that model-based RL can provide better sample-effi ciency 9,2,5 but showing this in practice has been challenging. In this paper, we propose a way to improve planning in model-based RL and show that it can lead to improved performanc
6、e and better sample effi ciency. In model-based RL, planning is done by computing the expected result of a sequence of future actions using an explicit model of the environment. Model-based planning has been demonstrated to be effi cient in many applications where the model (a simulator) can be buil
7、t using fi rst principles. For example, model-based control is widely used in robotics and has been used to solve challenging tasks such as human locomotion 34, 35 and dexterous in-hand manipulation 21. In many applications, however, we often do not have the luxury of an accurate simulator of the en
8、vironment. Firstly, building even an approximate simulator can be very costly even for processes whose dynamics is well understood. Secondly, it can be challenging to align the state of an existing simulator with the state of the observed process in order to plan. Thirdly, the environment is often n
9、on-stationary due to, for example, hardware failures in robotics, change of the input feed and deactivation of materials in industrial process control. Thus, learning the model of the environment is the only viable option in many applications and learning needs to be done for a live system. And sinc
10、e many real-world systems are very complex, we are likely to need powerful function approximators, such as deep neural networks, to learn the dynamics of the environment. However, planning using a learned (and therefore inaccurate) model of the environment is very diffi cult in practice. The process
11、 of optimizing the sequence of future actions to maximize the Equal contribution, rest in alphabetical order Preprint. Under review. expected return (which we call trajectory optimization) can easily exploit the inaccuracies of the model and suggest a very unreasonable plan which produces highly ove
12、r-optimistic predicted rewards. This optimization process works similarly to adversarial attacks 1,13,33,7 where the input of a trained model is modifi ed to achieve the desired output. In fact, a more effi cient trajectory optimizer is more likely to fall into this trap. This can arguably be the re
13、ason why gradient-based optimization (which is very effi cient at for example learning the models) has not been widely used for trajectory optimization. In this paper, we study this adversarial effect of model-based planning in several environments and show that it poses a problem particularly in hi
14、gh-dimensional control spaces. We also propose to remedy this problem by regularizing trajectory optimization using a denoising autoencoder (DAE) 37. The DAE is trained to denoise trajectories that appeared in the past experience and in this way the DAE learns the distribution of the collected traje
15、ctories. During trajectory optimization, we use the denoising error of the DAE as a regularization term that is subtracted from the maximized objective function. The intuition is that the denoising error will be large for trajectories that are far from the training distribution, signaling that the d
16、ynamics model predictions will be less reliable as it has not been trained on such data. Thus, a good trajectory has to give a high predicted return and it can be only moderately novel in the light of past experience. In the experiments, we demonstrate that the proposed regularization signifi cantly
17、 diminishes the adversarial effect of trajectory optimization with learned models. We show that the proposed regularization works well with both gradient-free and gradient-based optimizers (experiments are done with cross-entropy method 3 and Adam 14) in both open-loop and closed-loop control. We de
18、monstrate that improved trajectory optimization translates to excellent results in early parts of training in standard motor-control tasks and achieve competitive performance after a handful of interactions with the environment. 2Model-Based Reinforcement Learning In this section, we explain the bas
19、ic setup of model-based RL and present the notation used. At every time stept, the environment is in statest, the agent performs actionat, receives reward rt= r(st,at)and the environment transitions to new statest+1= f(st,at). The agent acts based on the observationsot= o(st)which is a function of t
20、he environment state. In a fully observable Markov decision process (MDP), the agent observes full stateot= st. In a partially observable Markov decision process (POMDP), the observationotdoes not completely revealst. The goal of the agent is select actionsa0,a1,.so as to maximize the return, which
21、is the expected cumulative reward EP t=0r(st,at). In the model-based approach, the agent builds the dynamics model of the environment (forward model). For a fully observable environment, the forward model can be a fully-connected neural network trained to predict the state transition from time t to
22、t + 1: st+1= f(st,at).(1) In partially observable environments, the forward model can be a recurrent neural network trained to directly predict the future observations based on past observations and actions: ot+1= f(o0,a0,.,ot,at).(2) In this paper, we assume access to the reward function and that i
23、t can be computed from the agent observations, that is rt= r(ot,at). At each time stept, the agent uses the learned forward model to plan the sequence of future actions at,.,at+H so as to maximize the expected cumulative future reward. G(at,.,at+H) = E t+H X =t r(o,a) # at,.,at+H= argmaxG(at,.,at+H)
24、. This process is called trajectory optimization. The agent uses the learned model of the environment to compute the objective functionG(at,.,at+H). The model(1)or(2)is unrolledHsteps into the future using the current plan at,.,at+H. 2 Algorithm 1 End-to-end model-based reinforcement learning Collec
25、t data D by random policy. for each episode do Train dynamics model fusing D. for time t until the episode is over do Optimize trajectory at,ot+1,.,at+H,ot+H+1. Implement the fi rst action atand get new observation ot. end for Add data (s1,a1,.,aT,oT) from the last episode to D. end for The optimize
26、d sequence of actions from trajectory optimization can be directly applied to the environment (open-loop control). It can also be provided as suggestions to a human operator with the possibility for the human to change the plan (human-in-the-loop). Open-loop control is challenging because the dynami
27、cs model has to be able to make accurate long-range predictions. An approach which works better in practice is to take only the fi rst action of the optimized trajectory and then re- plan at each step (closed-loop control). Thus, in closed-loop control, we account for possible modeling errors and fe
28、edback from the environment. In the control literature, this fl avor of model-based RL is called model-predictive control (MPC) 22, 30, 16, 24. The typical sequence of steps performed in model-based RL are: 1) collect data, 2) train the forward modelf, 3) interact with the environment using MPC (thi
29、s involves trajectory optimization in every time step), 4) store the data collected during the last interaction and continue to step 2. The algorithm is outlined in Algorithm 1. 3Regularized Trajectory Optimization 3.1Problem with using learned models for planning In this paper, we focus on the inne
30、r loop of model-based RL which is trajectory optimization using a learned forward modelf . Potential inaccuracies of the trained model cause substantial diffi culties for the planning process. Rather than optimizing what really happens, planning can easily end up exploiting the weaknesses of the pre
31、dictive model. Planning is effectively an adversarial attack against the agents own forward model. This results in a wide gap between expectations based on the model and what actually happens. We demonstrate this problem using a simple industrial process control benchmark from 28. The problem is to
32、control a continuous nonlinear reactor by manipulating three valves which control fl ows in two feeds and one output stream. Further details of the process and the control problem are given in Appendix A. The task considered in 28 is to change the product rate of the process from 100 to 130 kmol/h.
33、Fig. 1a shows how this task can be performed using a set of PI controllers proposed in 28. We trained a forward model of the process using a recurrent neural network(2)and the data collected by implementing the PI control strategy for a set of randomly generated targets. Then we optimized the trajec
34、tory for the considered task using gradient-based optimization, which produced results in Fig. 1b. One can see that the proposed control signals are changed abruptly and the trajectory imagined by the model signifi cantly deviates from reality. For example, the pressure constraint (of max 300 kPa) i
35、s violated. This example demonstrates how planning can easily exploit the weaknesses of the predictive model. 3.2Regularizing Trajectory Optimization with Denoising Autoencoders We propose to regularize the trajectory optimization with denoising autoencoders (DAE). The idea is that we want to reward
36、 familiar trajectories and penalize unfamiliar ones because the model is likely to make larger errors for the unfamiliar ones. This can be achieved by adding a regularization term to the objective function: Greg= G + logp(ot,at.,ot+H,at+H),(3) 3 100 105 110 115 120 125 130 Product Rate kmol/h 2760 2
37、780 2800 2820 2840 2860 2880 Pressure kPa 0102030 50 55 60 65 A in purge mole % 0102030 0 20 40 60 80 Manipulated variables % 90 100 110 120 130 140 150 Product Rate kmol/h 2700 2800 2900 3000 3100 Pressure kPa 0102030 47.5 50.0 52.5 55.0 57.5 60.0 62.5 A in purge mole % 0102030 50 0 50 100 150 Mani
38、pulated variables % 100 105 110 115 120 125 130 Product Rate kmol/h 2700 2725 2750 2775 2800 2825 2850 2875 Pressure kPa 0102030 47.5 50.0 52.5 55.0 57.5 60.0 62.5 A in purge mole % 0102030 30 40 50 60 70 80 Manipulated variables % (a) Multiloop PI control(b) No regularization(c) DAE regularization
39、Figure 1: Open-loop planning for a continuous nonlinear two-phase reactor from 28. Three subplots in every subfi gure show three measured variables (solid lines): product rate, pressure and A in the purge. The black curves represent the models imagination while the red curves represent the reality i
40、f those controls are applied in an open-loop mode. The targets for the variables are shown with dashed lines. The fourth (low right) subplots show the three manipulated variables: valve for feed 1 (blue), valve for feed 2 (red) and valve for stream 3 (green). s0 f a1 s1 r1 g c1 f a2 s2 r2 Figure 2:
41、Example: fragment of a computational graph used during trajectory optimization in an MDP. Here, window size w = 1, that is the DAE penalty term is c1= kg(s1,a1) s1,a1k2. wherep(ot,at,.,ot+H,at+H)is the probability of observing a given trajectory in the past experi- ence andis a tuning hyperparameter
42、. In practice, instead of using the joint probability of the whole trajectory, we use marginal probabilities over short windows of size w: Greg= G + t+Hw X =t logp(x)(4) where x= o,a,.o+w,a+w is a short window of the optimized trajectory. Suppose we want to fi nd the optimal sequence of actions by m
43、aximizing(4)with a gradient-based optimizationprocedure. Wecancomputegradients Greg ai bybackpropagationinacomputationalgraph where the trained forward model is unrolled into the future (see Fig. 2). In such backpropagation- through-time procedure, one needs to compute the gradient with respect to a
44、ctions ai. Greg ai = G ai + i+w X =i x ai x logp(x),(5) where we denote byxa concatenated vector of observationso,.o+wand actionsa,.a+w, over a window of sizew. Thus to enable a regularized gradient-based optimization procedure, we need means to compute x logp(x). In order to evaluatelogp(x)(or its
45、derivative), one needs to train a separate modelp(x)of the past experience, which is the task of unsupervised learning. In principle, any probabilistic model can be used for that. In this paper, we propose to regularize trajectory optimization with a denoising autoencoder (DAE) which does not build
46、an explicit probabilistic modelp(x)but rather learns to approximate the derivative of the log probability density. The theory of denoising 23,27 states that 4 the optimal denoising function g( x) (for zero-mean Gaussian corruption) is given by: g( x) = x + 2 n x logp( x), wherep( x)is the probabilit
47、y density function for data xcorrupted with noise andnis the standard deviation of the Gaussian corruption. Thus, the DAE-denoised signal minus the original gives the gradient of the log-probability of the data distribution convolved with a Gaussian distribution: x logp( x) g(x) x. Assuming x logp(
48、x) x logp(x) yields Greg ai = G ai + i+w X =i x ai (g(x) x).(6) Using x logp( x)insteadof x logp(x)canbehavebetterinpracticebecauseitissimilartoreplacing p(x)with its Parzen window estimate 36. In automatic differentiation software, this gradient can be computed by adding the penalty termkg(x) xk2to
49、Gand stopping the gradient propagation throughg. In practice, stopping the gradient throughg did not yield any benefi ts in our experiments compared to simply adding the penalty termkg(x) xk2to the cumulative reward, so we used the simple penalty term in our experiments. Also, this kind of regulariz
50、ation can easily be used with gradient-free optimization methods such as cross-entropy method (CEM) 3. Our goal is to tackle high-dimensional problems and expressive models of dynamics. Neural networks tend to fare better than many other techniques in modeling high-dimensional distributions. However
51、, using a neural network or any other fl exible parameterized model to estimate the input distribution poses a dilemma: the regularizing network which is supposed to keep planning from exploiting the inaccuracies of the dynamics model will itself have weaknesses which planning will then exploit. Cle
52、arly, DAE will also have inaccuracies but planning will not exploit them because unlike most other density models, DAE develops an explicit model of the gradient of logarithmic probability density. The effect of adding DAE regularization in the industrial process control benchmark discussed in the p
53、revious section is shown in Fig. 1c. 3.3Related work Several methods have been proposed for planning with learned dynamics models. Locally linear time- varying models 17,19 and Gaussian processes 8,15 or mixture of Gaussians 29 are data-effi cient but have problems scaling to high-dimensional enviro
54、nments. Recently, deep neural networks have been successfully applied to model-based RL. Nagabandi et al.24use deep neural networks as dynamics models in model-predictive control to achieve good performance, and then shows how model-based RL can be fi ne-tuned with a model-free approach to achieve e
55、ven better performance. Chua et al.5introduce PETS, a method to improve model-based performance by estimating and propagating uncertainty with an ensemble of networks and sampling techniques. They demonstrate how their approach can beat several recent model-based and model-free techniques. Clavera e
56、t al.6 combines model-based RL and meta-learning with MB-MPO, training a policy to quickly adapt to slightly different learned dynamics models, thus enabling faster learning. LevineandKoltun20andKumaretal.17useaKLdivergencepenaltybetweenactiondistributions to stay close to the training distribution.
57、 Similar bounds are also used to stabilize training of policy gradient methods 31,32. While such a KL penalty bounds the evolution of action distributions, the proposed method also bounds the familiarity of states, which could be important in high-dimensional statespaces. Whilepenalizingunfamiliarst
58、atesalsopenalizeexploration, itallowsformorecontrolled and effi cient exploration. Exploration is out of the scope of the paper but was studied in 10, where a non-zero optimum of the proposed DAE penalty was used as an intrinsic reward to alternate between familiarity and exploration. 4Experiments o
59、n Motor Control We show the effect of the proposed regularization for control in standard Mujoco environments: Cart- pole, Reacher, Pusher, Half-cheetah and Ant available in 4. See the description of the environments in Appendix B. We use the Probabilistic Ensembles with Trajectory Sampling (PETS) model from 5 as the baseline, which achieves the best repor
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- 二年级下语文数学试卷
- 高难度剧本杀数学试卷
- 肛肠科中医课件
- 光山县招教试题数学试卷
- 肉鸡生物安全课件
- 飞线充电培训课件
- 2024年10月辽宁2024抚顺县农村信用合作联社校园招考笔试历年参考题库附带答案详解
- 超声骨密度培训课件
- 四川南充临江建设发展集团有限责任公司员工招聘考试真题2024
- 2024年眉山职业技术学院招聘笔试真题
- DB44-T 1948-2016 移动通信固定终端天线
- 2023广西公需科目真题(关于人才工作的重要论述)
- 管道非开挖修复方案
- 四升五数学入学摸底考试
- 外研版(三起)英语三年级上册全册课件
- 江苏省射阳中等专业学校教师招聘考试真题2022
- 熔铸作业指导书
- 2022年全球及射频微波仪器行业发展现状分析
- 经皮胃镜下胃造瘘空肠管置入术
- 暑期托管服务考核表
- 车站旅客地道施工方案
评论
0/150
提交评论