IROS2019国际学术会议论文集2685_第1页
全文预览已结束

付费下载

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

Learning continuous time control policies by minimizing the Hamilton-Jacobi-Bellman residual Michael Lutter1, Boris Belousov1, Kim Listmann2, Debora Clever1,2and Jan Peters1,3 Specifying a task through a reward function and letting an agent autonomously discover a corresponding controller promises to simplify programming of complex robotic behav- iors by reducing the amount of manual engineering required. Previous research demonstrated that such approach can successfully generate robot controllers capable of performing dexterous manipulation and locomotion. These controllers were obtained via reinforcement learning or trajectory opti- mization. While reinforcement learning optimizes a possibly non-linear policyunder the assumption of unknown rewards, dynamics and actuation limits, trajectory optimization plans a sequence ofnactions and states using a known model, reward function, initial state and the actuator limits. When applied to the physical system, the planned trajectories must be augmented with a hand-tuned tracking controller to compensate modeling errors. To obtain a globally optimal feedback policy that naturally obeys the actuator limits without randomly sampling actions on the system as in reinforcement learning, we propose to incorporate actuator limits within the cost function and obtain the corresponding optimal feedback controller by learning the value function using the Hamilton-Jacobi-Bellman (HJB) diff erential equation. Assuming the inherent structure of most robotic tasks, i.e., control-affi ne dynamics due to holonomicity of mechanical systems and separable costs, i.e., c(x, u) = q(x)+g(u)with the statesx, actionsu, state costq and strictly convex action costg, we derive the optimal policy in closed form using the HJB. This optimal policy is globally optimal on the state domain, guaranteed to be stable and does not require any replanning or hand-tuning of the feedback gains, given the true optimal value function. Furthermore, this closed form policy enables the shaping of the optimal policy and deriving the corresponding cost function s.t. the shaped policy is optimal. Therefore, we can incorporate the action /20+/2+ rad 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 rad/s (a) HJB Control Trajectories 0.0 13.5 27.0 40.5 54.0 67.5 81.0 94.5 108.0 121.5 /20+/2+ rad 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 rad/s (b) Multiple Shooting Trajectories 0.0 13.5 27.0 40.5 54.0 67.5 81.0 94.5 108.0 121.5 /20+/2+ rad 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 rad/s (c) LQR Trajectories 0.0 13.5 27.0 40.5 54.0 67.5 81.0 94.5 108.0 121.5 (d) 0 2 4 p(c) 102 Reward Distribution HJB Control 0 2 4 p(c) 102 Multiple Shooting 020406080100120 Reward 0 2 4 p(c) 102 LQR Fig. 1: (a-c) Learned value function for the pendulum with log cosine cost and the trajectories for HJB control (a), multiple shooting (b) and LQR (c) from 300 randomly sampled starting confi gurations. (d) Cost distributionsp(c)for the sampled starting confi gurations. limits implicitly by limiting the range of the optimal policy and deriving the corresponding cost function. To obtain the optimal value function using known system model, we propose to embed a deep diff erential network in the HJB and learn the network weights by minimizing the HJB residual while applying a curricular learning scheme. The curricular learning scheme adapts the discounting from short to far- sighted to ensure learning of the optimal policy despite the multiple spurious solutions of the HJB. Therefore, our approach enables the end-to-end learning of the optimal value function without simulating or running the system. To evaluate our proposed approach, the learned optimal feedback policy was applied to the torque limited pendulum and compared to shooting methods and the linear quadratic regulator (LQR). The learned value function and the state tra- jectories from 300 starting confi gurations are shown in Figure 1. Our approach learns the discontinuous value function with the ramps leading to the balancing point. These ramps are caused by the torque limits, which prevent the direct swing-up of the pendulum. The learned optimal policy can swing-up the pendulum from all 300 starting confi gurations (Fig. 1a) and achieves a similar cost distribution as multiple shooting (Fig. 1d), which only obtains single optimal trajectories and no feedback controller. In contrast, LQR can only balance the pendulum for few starting confi gurations due to the linearization and torque limits. Acknowledgement 1TU Darmstadt.2ABB Corporate Research Center Ger- many. 3MPI for Intelligent Systems. This project has received funding from the European Unions Horizon 2020 research and innovation program under grant agreement No #640554 (SKILLS4ROBOTS). Furthe

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论