




免费预览已结束,剩余1页可下载查看
下载本文档
版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
End to end sensorimotor control problems of AUVs with deep reinforcement learning Hui Wu1 Shiji Song1 Yachu Hsu1 Keyou You1 Cheng Wu1 Abstract This paper studies on sensorimotor control prob lems of Autonomous Underwater Vehicles AUVs using deep reinforcement learning We design an end to end learning ar chitecture mapping original sensor input to continuous control output without referring to the dynamics of vehicles To avoid diffi cult and noisy underwater localization we implement the learning without knowing the positions of AUVs by proposing novel state encoder and reward shaping strategies Two distinct underwater tasks obstacle avoidance with sonar sensor and pipeline following with visual sensor are simulated to validate the effectiveness of proposed architecture and strategies For the latter we test the learned policy on realistic images of underwater pipelines to check its generalization ability I INTRODUCTION Recently as the development of Deep Reinforcement Learning DRL more and more Artifi cial Intelligence AI problems with high dimensional perception input can be solved via an end to end architecture which maps directly the perception to the control action without referring the dynamics of agents 1 2 In this paper we consider applying DRL to solve sensorimotor control problems of Autonomous Underwater Vehicles AUVs which requires AUVs to execute underwater tasks relying on equipped sensors The reason for choosing DRL is because it is hard to obtain the exact hydrodynamics of an AUV due to com plicated underwater environment Moreover most control commands of AUVs need to rely on equipped sensor devices because of diffi cult and noisy underwater localization The model free property and perception ability of DRL make it perfectly qualifi ed for such control problems We propose an end to end RL architecture for the sensori motor control problems of AUVs by means of Proximal Poli cy Optimization PPO 3 the state of art for the continuous control of robotics To validate the generality of the proposed architecture we choose two distinct sensorimotor tasks with different types of sensor inputs In addition we assume the AUV cannot access its position to eliminate the expense for underwater localization which makes the tasks much more diffi cult Instead we propose novel state encoding and reward shaping strategies to remedy the missing of location The reminder of this paper is organized as follows Section II enumerates some related works with respect to RL appli cations in robotic control Section III gives a general presen tation of sensorimotor control problems of AUVs and the This work was supported in part by the National Science Foundation of China under Grant 41427806 and Grant 41576101 and in part by the National Key Research and Development Program of China under Grant 2016YFC0300801 1There authors are with Department of Automation Tsinghua University Beijing P R China wuhui14 proposed end to end architecture Section IV exhaustively illustrates deep reinforcement learning algorithm and novel strategies for two sensorimotor control problems of AUVs Section V describes the platforms settings performances and analysis of experiments for two tasks while the conclusion and outlook are presented in VI II RELATED WORK The DRL has many remarkable applications in control problems with high dimensional sensor input such as games visuomotor robotic control Using asynchronous actor critic Perot and Jaritz et al trained a CNN LSTM policy network mapping visual input to discrete control command steering brake etc and realize self driving in a realistic car racing game 4 5 For control problems with continuous action Timothy et al proposed a Deep Deterministic Policy Gradi ent DDPG algorithm to train a CNN network with experi ence replay and apply it in simulated multi joint dynamics and car racing games 2 As on policy RL algorithms Trust Region Policy Optimization TRPO and PPO were proposed to reduce variance by constrain the policy updating and validated for low dimensional robotic control 3 6 For the control problems of AUVs most research focused on model based methods such as backstepping 7 8 sliding mode 9 10 model predictive control 11 12 requiring exact dynamics of an AUV which is hard to obtain in realistic underwater applications As a kind of model free method RL has been applied to solve control problems of AUVs recently Hu et al modeled the plume tracking problem of AUVs as a partially observable Markov decision process and learned a strategy based on long short term memory based RL 13 Similar with our second task Liu et al learned a pixel to action policy using PPO for the pipeline following task of AUVs 14 However the work doesn t consider the diffi culty of the costly and limited localization in underwater environment III PROBLEMFORMULATION A 6DOFs motions of AUVs In a three dimensional space the motions of an AUV are described by six Degrees Of Freedom DOFs kinematic variables in a moving coordinate frame whose origin is fi xed on the AUV body surge u heave w sway v refer to the longitudinal vertical and sideways displacements while roll p yaw r pitch q refer to the rotation around the correspond ing axes We denote the linear and angular velocity in the local coordinate frame as a vector u v w p q r T To locate the vehicle an earth fi xed coordinate frame is 2019 IEEE RSJ International Conference on Intelligent Robots and Systems IROS Macau China November 4 8 2019 978 1 7281 4003 2 19 31 00 2019 IEEE5869 Fig 1 The six DOFs coordinate frames of the AUV motions introduced to determine the relative position p x y z T and orientation o Tof the AUV The six DOFs and two coordinate frames are illustrated in Fig 1 The kinetics and hydrodynamics of the AUV are described by the following second order model 15 J M C v D v g 1 where x y z Tdenotes the merged vector of p and p M denotes the total inertial matrix C the coriolis and centripetal forces matrix D the hydrodynamic damp ing matrix g the restoring forces and moments and is the vector of forces and torques from equipped propellers B End to end sensorimotor control problems of AUVs In a sensorimotor control problem the AUV need to execute a task relying on sensor signals collected by e quipped sensor devices such as sonars or cameras Most of conventional solutions adopt a two step planning control framework which converses the sensor signals into a plan ning objective and then implements the objective with a model based controller However multiple kinds of errors in underwater tasks may occur in the two steps and fail the whole paradigm The model bias is one of the most common error for model based underwater control because of the non linear dynamics and time variant hydrodynamic coeffi cients in 1 In addition underwater localization usually introduces high errors and takes high expenses because of limited spread distance of signals in underwater environment which may deteriorate the control precision Therefore we design an end to end learning architecture based on PPO to learn a sensorimotor policy function that maps sensor signals to continuous control outputs directly without referring to the dynamics of the AUV as illustrated in Fig 2 We divide the state of the AUV into ssensorand smotion representing the sensor signals and the motion variables the orientation vector oand the velocity vector respectively and embed them through different encoder networks In order to establish the PPO objective in Section V we construct two function estimators the policy network and the value network V which share the state encoder networks to reduce the model complexity Both errors of the policy and value networks are mixed together into the PPO object and backpropagated to train weights of the whole networks Encoder Encoder C FC Value net Policy net Shared Network Propellers DynamicsSensors Environment Control Fig 2 The end to end RL architecture for sensorimotor control problems of AUVs The state is mapped to the policy and value functions via a multilayer neural network which can be divided into shared part policy net and value net The shared part is used to separately encode the dynamics and sensor parts of the state and concatenate the both outputs as the pink circle with capital C The objective of the PPO is constructed by the reward from environment and the outputs of networks and backpropagated to train the network To validate the proposed learning architecture we select two distinct sensorimotor control tasks with different types of sensor signals To avoid noisy and costly underwater localization we assume AUVs cannot access their positions in both tasks and propose novel state encoding and reward shaping strategies as replacement in the next section which have signifi cant impacts on the performance of RL algorithm IV METHODS ANDSTRATEGIES A Proximal Policy Optimization The RL algorithm is a solving framework for Markov Decision Processes MDP where an agent interacts with an environment in a state stat discrete time steps t executes an action atand transits to a subsequent state st 1while obtaining a reward signal rt 1from the environment The objective of MDP is commonly to maximize an expected total discounted reward J E PT t 0 trt 1 with respect to a policy function a s that is a conditional distribution on action space given a state where T is the fi nite or infi nite horizon and is a discounted factor exponentially decaying weights on future rewards The actions for control problems of AUVs are usually vectors of forces and torques generated by propellers For such continuous action space we adopt PPO to train the policy network PPO is a kind of policy gradient algorithm which defi nes a parameterized policy function a s and update it along the gradient J in the parameter space of via a Stochastic Gradient Ascent SGA algorithm To solve high variance and stepsize choosing problems in SGA PPO proposes a clipped surrogate objective 3 LCLIP Et h min wt At clip wt 1 1 At i 2 where Etdenotes the average over a batch of sampled transi tions Atdenotes the estimator of the advantage function and the function clip x a b limits the value of x in an interval a b The weight wt is the probability ratio of new and 5870 a Obstacle avoidance scene b Distribution of sonars Fig 3 3a shows the scene of the obstacle avoidance task 3b illustrates the distribution and signals of sonars equipped on the AUV old policy functions given by wt at st old at st 3 which measures the change between policies after and before one update The fi rst term in the minimization is a surrogate objective which is a lower bound of J if substracted with a Kullback Leibler divergence penalty as proved in TRPO 6 The second term clips the probability ratio in the interval 1 1 to remove the possible deterioration on the objective if a large policy update happens The minimization taken over these two terms makes sure the worse one is optimized Besides LCLIP the total objective of PPO also includes a value function error term minimized for the estimator Atand an entropy bonus to encourage suffi cient exploration LPPO LCLIP 1 Et V st V targ t 2 2 Et H st 4 where V targ t is the supervision of the value function in state stobtained via cumulative return or Generalized Advantage Estimation GAE 16 and H denotes the entropy function of a distribution B Obstacle Avoidance Task This task aims at learning a continuous controller which controls an AUV to track a series of waypoints without crashes in a maze like scene which is fi lled with obstacles such as rocks pipelines For simplicity we only examine the motions of the AUV on the x y plane and ignore the DOFs out of the plane as shown in Fig 3 a The positions of the vehicle and waypoints are assumed unknown in the task In addition we assume the AUV will receive a signal from environment once arriving at nearby area of some waypoint where the signal is from sensor devices camera chemical concentration meter thermometer etc and we give it a positive bonus for rewarding To avoid the obstacles the AUV can obtain a partial sensor signal from sonar like devices fi xed in its front and both anterolateral sides each of which measures the distance from itself to the nearest obstacle along a certain direction as shown in Fig 3 b In this task the sensor part of the state consists of all the fi ve distance readings as ssensor d1 d2 d5 T 1 One hot target encoding The involved motion vari ables of the AUV on the x y plane are x y u v r T To avoid underwater localization we replace the position variables x y with a one hot target encoding for the indexes of waypoints to help the AUV distinguish which waypoint is tracked currently For example if the task gives fi ve waypoints the fi rst waypoint is encoded by 1 0 0 0 0 T the second one as 0 1 0 0 0 Tand so on for each of the other waypoints In spite of increasing dimensions of the state and trained weights in networks this strategy eliminates the expense and noise of underwater localization and enables the learned policy to track multiple waypoints by changing the one hot target encoder in the state 2 Multi objective reward shaping Due to the hazardous underwater environment any collision may cause serious and irretrievable damage to the AUV body Therefore we design a multi objective reward function by synthesizing waypoint tracking obstacle avoidance collision punishment and velocity constraint terms which is given by r ralive rvel rarrival rimpatienceotherwise rcollisionif colliding 5 where all the terms are summarized as follows rcollision a heavy punishment if any distance reading is lower than a given threshold which means a possible collision ralive a summation of the fi ve distance readings as ralive P5 i 1di to encourage the AUV to explore broader area of the scene rvel a velocity constraint term to constrain the surge sway yaw velocities which is given by rvel 1 u ud 2 2v2 3r2 6 where udis a constant target velocity and 1 2 3 denote the assigned weights for three terms rarrival a positive bonus if the AUV arrives at surround ing area of some waypoint and zero otherwise rimpatience a tiny punishment for each time step to encourage the AUV to track all the waypoints as soon as possible C Pipeline Following Task The pipeline following task aims at controlling an AUV to follow a undersea pipeline and inspect the status of the pipeline such as pipeline leak anode block disappearance with an equipped underwater camera as shown in Fig 4 For the convenience of inspection the AUV should keep the pipeline in the view of its camera with its heading direction along the pipeline simultaneously We assume the underwater camera is the only sensor device for the AUV to locate the relative position between itself and the pipeline This is a hard task because the AUV can only infer the position and orientation of the pipeline from visual inputs We will check if we can learn a deep visuomotor policy via RL to realize pipeline following 5871 a Pipeline following scene b View of camera Fig 4 The scene of the pipeline following task and the view of the underwater camera Gaussian blur Edges extraction Hough transform Histogram of lines Fig 5 The image processing pipeline of reward extraction 1 Reward Extraction Unlike some visual tasks which provide an extrinsic reward such as video games 1 the pipeline following problem requires the AUV to extract reward from the view of the camera due to the missing of location which is the most diffi cult part in this task We propose a series of image processing procedures for the reward extraction as shown in Fig 5 The kernel is to extract the center line of the pipeline in the view image Firstly we adopt a Gaussian blur to denoise the original image and extract edges using a Canny detector And then a classic line detection algorithm the Hough Transform is used to extract all straight lines in the image which are represented as a coordinate H H in a called Hough space as shown in Fig 6 a In order to determine the coordinate of the center line of the pipeline in the Hough space we divide the angle range into 36 bins and count a histogram for all detected straight lines As the centerline is parallel with the pipeline s margin lines which are the majority of detected lines we can approximate Hof the center line using the mean value of Hs of the lines assigned to the bin with highest count in the histogram And Hof the center line is the middle of Hs of the farthest and nearest lines in that bin After locating the center line we design the following reward function r u cos cH dc dmax 7 where u is the surge velocity of the AUV cHis Hof the center line dcdenotes the distance from the center point of the image to the closest point on the center line and dmax is a normalized factor equal with half of the diagonal length of the image 2 Visuomotor policy architecture In this task the sensor part of the state is the view image from the underwater camera Intuitively we design a visuomotor policy architec O H H x y a Hough space b AUV girona Fig 6 6a shows the Hough space consisting of Hand H 6b illustrates the distribution of thrusters on the AUV girona Camera Input Conv Pool Relu Conv Pool Relu Conv Pool Relu FC Relu LSTM FC Relu FC FC Fig 7 The LSTM CNN visuomotor policy network ture combined with Convolutional Neural Network CNN which is a common encoder for image data As shown in Fig 7 the chosen CNN network is a 4 layers architecture similar with the DQN network in 1 To encode memorial information in multiple states in a time sequence we adopt a CNN LSTM Long Short Term Memory architecture simi lar with Jaritz et al 5 but add LSTM in a different position to reduce the model complexity The outputs of the CNN and the LSTM are concatenated to feed a full connected layer which is followed by separated value network and policy network Due to the continuous action space the output of policy network is a gaussian distribution parameterized by the mean and the covaria
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
评论
0/150
提交评论