




免费预览已结束,剩余1页可下载查看
下载本文档
版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
Robotic Tracking Control with Kernel Trick-based Reinforcement Learning Yazhou Hu,1Wenxue Wang 2, Hao Liu3and Lianqing Liu 2 AbstractIn recent years, reinforcement learning has been developed dramatically and is widely used to solve control problems, e.g., playing games. However, there are still some problems for reinforcement learning to perform robotic control tasks. Fortunately, the kernel trick-based methods provide a chance to deal with those challenges. This work aims at de- veloping a kernel trick-based learning control method to carry out robotic tracking control tasks. A reward system, in this work, is presented in order to speed up the learning processes. And then, a kernel trick-based reinforcement learning tracking controller is presented to perform tracking control tasks on a robotic manipulator system. To evaluate the policy and assist the reward system to accelerate the speed of fi nding the optimal control policy, a critic system is introduced. Finally, from the comparison with the benchmark, the simulation results illustrate that our algorithm has faster convergence rate and can execute tracking control tasks effectively, the reward function and the critic system proposed in this work is effi cient. I. INTRODUCTION With the development of machine learning and robotic techniques, robotic manipulators play more and more im- portant roles in our lives and industrial manufacturing, e.g., medical assistance 1 and artifact assembling 2. Sequen- tially, more and more researchers pay their attention on robotic control, especially robotic tracking control. The goal of robotic tracking control is to enable the robot to track a target trace with great tracking performance 3, 4. How- ever, because of the existence of the inherent nonlinearities, parametric uncertainties and high coupling in the dynamic model, it is a challenge to design tracking controller for robotic manipulators 5. For making the robot track as close as the desired target, many advanced tracking control algorithms are proposed by researchers 6, 7, such as sliding-model control 8, 9, exponential nonlinear tracking control 10, continuous Jacobian transpose robust control 11 and adaptive synchronised tracking control 12. In recent fi ve years, reinforcement learning (RL) tech- niques have been developed dramatically and are widely utilized in all kinds of occupations. Thereby, various algo- rithms based on RL are proposed to perform robotic control 1Author is with State Key Laboratory of Robotics, Shenyang Insti- tute of Automation, Institutes for Robotics and Intelligent Manufactur- ing, Chinese Academy of Sciences, Shenyang, 110016 China, and also with University of Chinese Academy of Sciences, Beijing, 100049, China 2Authors are with State Key Laboratory of Robotics, Shenyang Institute of Automation, Institutes for Robotics and Intelligent Manufacturing, Chi- nese Academy of Sciences, Shenyang, 110016, Chinawangwenxue, liulianqing 3Author is the department of mathematics, Georgia Institute of Technol- ogy, Atlanta, GA, 30332, USA corresponding author. tasks. For controlling a robotic manipulator with nonlinearity and unknown physical parameters, literature 13 proposes a neural network model reinforcement learning method. In 14, interval type-2 fuzzy logic control and actor-critic RL algorirhms with one-order digital low-pass fi lters are used to overcome the diffi cultiescaused by the nonlinearities and uncertainties for the system and the working environment of accurate trajectory tracking control. Without knowing the dynamic model of the system, an adaptive optimal controller is proposed in literature 15 on the basis of the RL techniques to learning optimal Htracking control for nonlinear systems. An iterative adaptive dynamic program- ming technique is performed to build the stable iterative Q-learning algorithm in order to achieve nonlinear neuro- optimal tracking control in 16. Literature 17 presents a partial RL neural network-based adaptive tracking control method to settle the coupled problem to control in a wheeled mobile robot. However, there are some challenges to use RL methods to perform continuous control tasks, for example, curse of dimensionality. In 18, it has to train the controller off- line in order to avoid the problem of dimensionality curse. Fortunately, kernel trick-based methods provide a choice to deal with this kind of problem. Methods with kernel trick can not only learn optimal control policy from history data, not based on model of the environment, but can approximate the dynamic model through function approximator. Therefore, it is promising to avoid the so-called curse of dimensionality 19. In this study, a kernel trick-based RL is adopted. First, in this work, a reward function is proposed to opti- mize the control policy. And then, a kernel trick-based RL tracking controller is described to perform tracking control tasks in a robotic manipulator. Finally, a critic system is used to evaluate the found control policy and assistant the reward function to speed up the learning process. The simulation results show that our method has faster convergence rate and can achieve better tracking performance than the benchmark approach, which indicates that the reward function and the critic system can cooperate with the kernel trick-based RL tracking controller effectively. This work is organized as follows. In section II, the dynamic descriptions of a n-DOF robotic manipulator and the control objective are presented. Section III introduces the RL, model-based RL and the kernel-based RL, briefl y and respectively. The reward system, kernel trick-based RL tracking control algorithm and the critic system are described in section IV. Section V shows us about the simulation experimental results. Finally, this work is concluded in 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) Macau, China, November 4-8, 2019 978-1-7281-4003-2/19/$31.00 2019 IEEE997 section VI. II. SYSTEMDESCRIPTION A. Dynamic Model The dynamic equation of a n-DOF rigid robotic manipu- lator can be described by A() + b ( , ) + g () = ,(1) where , and Rnare the angular position, angular velocity and angular acceleration for the robotic manipulator, respectively. A() Rnnrepresents the symmetric and positive defi nite manipulator inertia matrix and the inverse matrix A1() of A() exists 20, b ( , ) Rnnindi- cates the centrifugal and Coriolis force matrix, g () Rn contains the gravitational force, Rndenotes the control inputs applied to the joints, e.g., torques and/or forces. Defi ne x1= , x2= , x = (x1, x2)T, y = , and then the state-space expression of equation (1) can be described by x = f (x) + h(x)u y = x1 ,(2) where f (x)= ( x2 A1(x1)(b(x1,x2)x2 g (x1) ) , h(x) = ( 0 A1(x1) ) and u = . B. Control Objective The objective of this work is to propose a kernel trick- based RL tracking controller u for system (1) to make sure that the joint positions of this manipulator y= x1= 1,2, ,nTcan follow a target trajectory yd= d1,d2, ,dnTvery well. III. PRELIMINARIES In this section, RL, model-based RL, kernel trick-based RL are reviewed, respectively. A. Reinforcement Learning A RL system can always be defi ned as an MDP 21, which is described by a tuple (S,A,P,R,), where S represents state space, A denotes action space, P describes a transition dynamic model, R is the reward and 0,1 the discount factor for future rewards. To a RL system, if the state s S, the action a A and the rewards R(s,a) are all given, the transitions model P (s,a,s) is produced, which describes the transition probability from state s to the next state sby taking the action a. If a RL task is given, the state value function can be ob- tained by V (s) = R(s,a)+V (s), where R(s,a) denotes the current reward, indicates a discount factor, V (s) represents the state for a next state. Similarly, the action- state value function is expressed as Q(s,a) = R(s,a) + sP (s,a,s )maxaQ(s,a,s), where P (s,a,s) de- scribes the transition probability. This type of functions, like state value function or/and the action-state value function, are one kind of Bellman optimal equations. Usually, the solver is searched by Dynamic Programming (DP) or Approximate Dynamic Programming (ADP) through online iteration 22. Therefore, the optimal control policy can be found by (s) = argmaxaQ(s,a). B. Model-Based Reinforcement Learning In model-based RL method, it it necessary to obtain a transition dynamic model of an environment, used to obtain the corresponding reward and the optimal policy 23. Therefore, compared with model-free methods, the primary work of modle-based method is to learn the transition model from the samples 24. Although model-based RL can learn effectively through the transition model with less samples, because of the model bias it is not wildly utilized in robotic control tasks 25. Thereby, new model approximate methods are needed to be proposed. C. Kernel-Based Reinforcement Learning Kernel-based RL is one method which can approximate the continuous MDPs in the form of non-parametric value functions from the history data, directly 26. There are three steps to perform KBRL for MDP problems: fi rst, build a fi nite MDP approximation from those samples; second, solve the fi nite approximation term; last update the solutions to the initial state space. Specifi cally, it approximates the outcome of an action a from a given state s as the average of previous outcomes of that action, weighted by a function of the distances between reference state s and the sampled state si27, 28. IV. PROPOSEDCONTROLDESIGN In this study, a kernel trick-based RL tracking controller is proposed to learn the control policy step by step. As shown in Fig. 1, the architecture of the proposed control design is composed of four parts: a robotic manipulator, a reward system , a critic system and a kernel trick-based RL tracking controller. Through the interaction between those four parts, the optimal control policy is learned, gradually. ? ? ? ? u y d y (), ii Qs a * (), ii Q s a Fig. 1.Architecture of kernel trick-based reinforcement learning tracking control algorithm. In this fi gure, u denotes the control policy, y indicates the output trajectory of the robotic manipulator, while ydis the target trajectory. There are three inputs u, y and yd for reward system, while the output of the reward system is the optimal action value function Q(si,ai). Q(si,ai) and Q(si,ai) are the inputs of critic system, which is utilized to adjust the parameter of kernel trick-based RL controller through equation (16) and equation (17). 998 A. Discounted Reward System In RL, the reward always plays an important role because it can offer evaluative feedbacks for the RL system to learn the optimal policy 29. Therefore, it is pretty signifi cant to design a proper reward. In this work, we propose a discounted reward method for RL system. According to the features of tracking control for the robotic manipulator, an immediate reward at the time step t can be described by rt= eT(t)Pe(t) + uT(t)Qu(t),(3) where e(t) = y (t) yd(t) indicates the tracking errors, y (t) the output trajectory of the robotic manipulator, yd(t) the desired trajectory, and u(t) the control policy. P, Q are both positive defi nite matrices. In equation (3), eT(t)Pe(t) describes the effects of tracking errors on the reward, while uT (t)Qu(t) shows the infl uences of control input policies to the reward. In RL, it is necessary to maximize the expected discounted rewards, defi ned by some specifi c function with the dis- counted reward sequences 21. However, in our work, we seek to minimize the expected discounted return in order to fi nd the control policy fast. Suppose that rt+1,rt+2,rt+3, can be used to represent the sequence of rewards derived from the environment after timestep t. The expected discounted return can be described by the sum of the weighted immediate rewards: Jt= N k=1 k1rt+k(4) = N k=1 k1 eT (t + k)Pe(t + k) + uT(t + k)Qu(t + k),(5) where N is the maximum time step of each episode and 0 1) indicates the discount. Equation (5) is adopted as a reward function to evaluate the feedbacks from the robotic manipulator. B. Kernel Trick-based RL Tracking Controller Suppose that a continuous model of RL for a robotics manipulator can be described by a continuous MDP, D = (S,A,P,r,), where S indicates the state, A denotes the action, P expresses the dynamic model, the corresponding expected rewards are described by r, and represents the discount factor. In the work, a new state S is defi ned as S = i= (s i,ri,si ) ,i = 1,2, ,n, which is a set of samples transitions resulting from action a A, where s i S is the next state of s i S after taking action a and r i r is the corresponding reward. Suppose (s,s) can be used to represent a kernel func- tion, described as follows: (s,s) = ( s s ) ,(6) where s and sare a sampled point and a center point for sampling, respectively, R represents a bandwidth, and (x) is a Lipschitz continuous function. The kernel function defi ned in equation (6) can be normalized as below: K(s,si) = (s,si) j (s,s j ). (7) In this study, the state s i is chosen as the sampling center point, it can be obtained that K (s,s i ) 0,1. Therefore, K (s,s i ) can be selected as the dynamics model for this constructed MDP represented as P (s|s i ) = K (s i ,s i ) , if s = s i and a= a, (1 K(s i ,s i ) K(s,s i) sK(s,s i), otherwise, (8) where a A is an action related to s S, a denotes the action from state s i to s i . Similarly, the immediate rewards of this constructed MDP are r (s,s i ) = r i, if s = s i and a= a, r i, otherwise, (9) where a and a are defi ned in the same as equation (8), is a large constant, r i = eT(i)Pe(i) + uT(i)Qu(i). Based on the discussion above, a new tuple is built by D = (S,A,P,r,), and the new tuple D can be utilized to calculate the optimal value function via dynamic programming or approximation dynamic programming . The total reward can be defi ned as Ji= N k=1 k1r (s,s i+k ) (10) Therefore, the value function can be represented as below for all s i i(i S), V (si) = E J i|si= si ,(11) where E denotes the expectation of a random variable. According to RL theory 21, it can be obtained that V (si)= a (a|s i ) s P (s|s i )r(s,s i+1 )+V (s i) , (12) where (a|s i ) indicates the policy. If the agent is following policy at time i, (a|s i ) represenst the probability of ai= a if si= s i 21. The probability P (s|s i ) is the dynamics of the fi nite MDP described above. Equivalently, the action-value function can be given as Q(si,ai) = s P (s|s i )r(s,s i+1 ) + Q(si,ai). (13) In the meanwhile, the optimal action-value function can be defi ned by Q(si,ai)= s P (s|s i ) r (s,s i+1 )+ min a i Q(si,ai) . (14) 999 Equation (14) is a Bellmans optimality equation. Ac- cording to the Bellman optimality principle 27, there is a solution for equation (14). As a consequence, the reward function proposed is solvable. Here, the problem of tracking control becomes a solution to RL problems. Thus, to RL, the value function estimate or action-value function estimate is uniquely defi ned as the solution to the approximation Bellman equation 30, and the Approximate Dynamic Programming (ADP) is the method to solve this equation. In other words, the problem of fi nding optimal control policy becomes a solution to a RL problem by ADP or related methods. Therefore, there exists a solution for the model described above. As for the optimal control policy, it can be found by u= arg min ai s P (s|s i ) r (s,s i+1 ) + min a i Q(si,ai) . (15) C. Critic System Design For fi nding the optimal policy fast, a critic system is designed to evaluate the current value function whether it is good or bad. According to Q-learning 21, we form a critic system to evaluate the value function so that the proposed RL tracking control method can derive the optimal control policy as soon as possible. This critic system includes a cost function, which can be described as equation (16), where Q(si,ai) is the last time action value of the robotic manipulator, sP (s|s i )r(s,s i+1 ) + mina iQ (s i,ai) is the ac- tion value of the current time. When L (here, is a small threshold), we think the whole system has kept stable, and all of the parameters will not be changed. However, if L , the value function will be tuned by equation (17) until all of the parameters keep invariant, gradually. In equation (17), is a learning rate. V. SIMULATION In this section, our control method is implemented on a two-link (2-DOF) robotic manipulator system as displayed in Fig. 2. In reinforcement learning control for robotics, it is inevitable to make a long time training and the safety for robot and the around environment should also be taken into consideration, therefore, simulation experiments are widely used instead of experiments on a real robot. In this work, the simulation methods are our fi rst choice to test the proposed algorithm. According to the section II, the details for the inertial ma- trix A(), the centrifugal and Coriolis force matrix b ( , ) and the gravitational force effect g () can be obtained by A() = A11()A12() A21()A22() , b(, ) = b11()b12() b21()b22() , g() = g11() g21(
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- 12克服胆怯(教学设计)-大象版心理健康四年级
- 第四单元第1课 身临其境 说课稿-2024-2025学年人教版(2024)初中美术七年级上册
- 第六课 成功贵在坚持说课稿-2025-2026学年小学心理健康川教版五年级上册-川教版
- 2025年高考生物试题分类汇编植物生命活动的调节(解析版)
- 2025年审计专业知识考试题及答案
- 2025年高考生物试题分类汇编:群落及其演替解析版
- 葡萄酒美容知识培训课件
- 小班科学连线题目及答案
- 2025经理聘用合同的范文
- 项目论文题目及答案范文
- 护士职业防护
- 反走私课件完整版本
- 酒店公共卫生事件应急预案
- 2024-2025学年小学劳动一年级上册人教版《劳动教育》教学设计合集
- 五年级开学第一课
- 雅思初级教程-unit-1-Great-places-to-be
- DL∕T 1664-2016 电能计量装置现场检验规程
- DL∕T 1455-2015 电力系统控制类软件安全性及其测评技术要求
- 心电监护仪的使用幻灯片
- 临床护理“三基”应知应会汇编
- 家委会给老师的感谢信
评论
0/150
提交评论