




免费预览已结束,剩余1页可下载查看
下载本文档
版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
Learning to Grasp Arbitrary Household Objects from a Single Demonstration Elias De Coninck Tim Verbelen Pieter Van Molle Pieter Simoens and Bart Dhoedt IDLab Department of Information Technology at Ghent University imec Email fi rstname lastname ugent be Abstract Upon the advent of Industry 4 0 collaborative robotics and intelligent automation gain more and more trac tion for enterprises to improve their production processes In order to adapt to this trend new programming learning and collaborative techniques are investigated Program by demonstration is one of the techniques that aim to reduce the burden of manually programming collaborative robots However this is often limited to teaching to grasp at a certain position rather than grasping a certain object In this paper we propose a method that learns to grasp an arbitrary object from visual input While other learning based approaches for robotic grasping require collecting a large dataset manually or automatically labeled in a real or simulated world our approach requires a single demonstration We present results on grasping various objects with the Franka Panda collaborative robot after capturing a single image from a wrist mounted RGB camera From this image we learn a robot controller with a convolutional neural network to adapt to changes in the object s position and rotation with less than 5 minutes of training time on a NVIDIA Titan X GPU achieving over 90 grasp success rate I INTRODUCTION The goal of Industry 4 0 is to lower production cost and increase effi ciency while maintaining the same level of robustness and accuracy by combining sensor data collec tion and smart robotics One of the approaches to increase effi ciency is to reduce the burden of programming robotic automation Increasingly popular hardware to address this are the so called collaborative robots or cobots 1 which can be programmed by demonstration and perform their tasks without safety cages or other safety measures lowering the integration cost The lowered safety requirements are accomplished by including force torque sensors that can detect external forces to avoid hard collisions with other surfaces Examples of such cobots available on the market are the Kuka LBR series 2 the Universal Robots UR series 3 and the Franka Panda 4 These cobots are used in a variety of applications e g production line loading and unloading product assembly and machine tending 5 Programming cobots can be accomplished by demon strations where the operator manipulates the robot s end effector to a desired position and orientation Later this recorded demonstration can then be repeated by executing a replay of the exact actions In industry this feature is called learning from demonstration while in machine learning research this term describes training a generalized policy from demonstrations 6 Therefore in this paper the term program by demonstration is used as it better defi nes the record and replay feature of cobots A drawback of these demonstration approaches is that it limits the applicability to cases where positions and orienta tions are fi xed relative to the robot For example in the case of pick and place tasks the objects and targets need to be at the same position for each repetition Small perturbations in position or rotation can effect the entire trajectory One way to fi x this problem is by attaching a vision sensor to the robot and using machine learning to detect the best position to grasp an object 7 However this typically requires a very long training time most often supervised on manually labeled training examples In this paper we introduce a collaborative robotic au tomation workfl ow which can be plugged in in current program by demonstration framework of cobots During a single demonstration of grasping an arbitrary object we obtain visual information to train a convolutional neural network which allows the robot to overcome perturbations in position and rotation of the object to grasp with a closed loop controller The remainder of this paper is structured as follows In the next section we discuss the related work in learning from demonstration and grasping objects using machine learning techniques In Section III we propose our approach of single demonstration learning with a convolutional neural network Section IV describes the hardware setup test objects and the demonstration fl ow used for the results shown in Section V Finally we discuss the results in the conclusion II RELATED WORK Grasping objects with a robotic manipulator has been a long standing challenge in research 8 9 In this section we give a brief overview on data driven grasping approaches which learn to grasp from camera images depth scans or a combination of both We refer to 7 for an in depth survey on data driven grasping techniques Grasping research has long been studied in multiple re search fi elds Recently results have been accomplished with training machine learning models 10 13 with one of the available labeled datasets with printable 3D objects 14 15 2D images 16 or real life benchmark objects 17 18 Creating large volume datasets is both challenging and time consuming Each object needs to be labeled with the position orientation width and height of the grasp location 2019 IEEE RSJ International Conference on Intelligent Robots and Systems IROS Macau China November 4 8 2019 978 1 7281 4003 2 19 31 00 2019 IEEE2372 A second approach uses reinforcement learning RL to train models by trial and error 19 20 For example the work presented by Levine et al in which 14 robots collect data over 800 000 grasp attempts 21 or Pinto et al in which a dataset of 50 000 data points are collected over 700 hours of robot grasp attempts 22 Instead of using an expensive real world robot farm Tobin et al use physics simulations to generate trials and then learn to transfer to the real world 23 The downside of RL is that it is sample ineffi cient and requires the construction of a reward function which can result in undesired behaviour A third approach uses human demonstrations to bootstrap learning from good trajectory samples 24 This is often combined with other learning algorithms such as RL 25 or guided policy search 26 but in order to train these networks multiple demonstrations are required Duan et al proposed a method to perform one shot imitation learning on previously unseen tasks from only a single demonstration 27 using supervised meta learning on previously demonstrated tasks In contrast to our work one shot or few shot imitation learning 27 30 requires demonstrations of other tasks to train a network for an unseen task while in this paper we only use meta learning to decrease convergence time In this paper we extend previous work 31 in which we used a single demonstration to grasp rectangular toy blocks We now incorporate robustness to different orientations sim ilar to 11 13 21 and allow for any demonstrated grasp pose in our previous work the system was limited to perpendicular grasps We also present results on various real world household objects instead of toy blocks Finally we show that we can execute our closed loop grasping algorithm in real time on a Jetson TX2 board III LEARNING FROM DEMONSTRATION Collaborative robots are making automation easier in production lines with their program by demonstration fea ture Operators without programming knowledge can record demonstration waypoints by guiding the end effector man ually or from a teacher pendant to a desired position From this demonstration the cobot can replay the actions and repeat the same sequence over and over During execution of the task however the cobot is unable to accept feedback or updates from the operator or other environment sensors Program by demonstration can be used in pick and place tasks that have fi xed actions and where the object is kept at exactly the same position and orientation Our aim is to make a more robust grasp controller that does not just replay the recorded positions but rather learns to grasp the same object as shown during the demonstration be it at a different location or orientation To do so we mount an RGB camera on the end effector of the robot capturing visual input during the demonstration Crucially we assume that we only have a single demonstration and that the workfl ow closely matches the current cobot programming workfl ow In order to train a neural network controller on a single recorded demonstration we consider the following require ments The robot operates in the same workspace as during the demonstrations The object to grasp is the same as during the demon stration The object is always visible from a hover pose preceding the grasp and centered in the camera frame captured in this pose The fi rst two requirements make sure that a single neural network is not expected to generalize to a wide variety of different workspaces and objects as we train a separate set of weights for each particular scene This allows for a moderate sized model enabling real time execution at run time The last requirement makes sure that we have an anchor to train the neural network as we will explain in the next sections Our approach makes robotic automation more affordable to implement specifi cally for small batch production runs by allowing the objects to be less accurately placed in the workspace A Generation of a Grasp Dataset During demonstration the operator adopts the same work fl ow as in a normal program by demonstration workfl ow he or she moves the end effector and records several waypoints for the cobot When demonstrating a grasp two poses are of importance the target pose ptarget where the gripper is closed and the previous hover pose phover where the arm approaches the object to grasp In contrast to related work instead of fi nding ptarget we train a controller to fi nd phover visually matching the one during the demonstration and then execute the relative motion to the demonstrated ptarget For each grasp demonstration we generate a dataset to train a neural network controller based on the camera frame captured at phover Crucially the object to grasp should be focused in the center of this frame as this will contain the visual features the controller searches for at run time Under this assumption we generate a dataset of positive samples by cropping 128 128 images from the center and negative samples by randomly cropping outside this center a b Fig 1 From the demonstration camera frame we generate random positive a and negative b samples by cropping rotating and adapting brightness and contrast By including demonstrations of other objects we can generate additional negative samples 2373 Fig 2 Neural network architecture a 128 128 cropped image is forwarded through 4 convolutional layers with 8 8 16 and 16 fi lters of size 5 5 and stride 2 followed by an average pooling layer and two fully connected layers implemented as 1 1 convolutions The outputs are two sigmoids and an arctan neurons estimating the grasp success and angle region We sample random rotations and apply random perturbations on brightness and contrast to further enhance robustness as in 32 Each sample is labeled with a positive or negative label as well as the applied rotation angle In case previous demonstrations or different objects were inputted to the system samples of these demonstrations can also be added to the current dataset as additional negative samples to further improve robustness An example dataset is shown in Fig 1 B Network Architecture Our neural network architecture is shown in Fig 2 It consists of 4 convolutional layers an average pooling layer and two fully connected layers implemented as 1 1 con volutions to support forwarding larger images For example forwarding a 640 480 image our camera s default resolu tion will result in an output size of 3 23 33 representing a grasp activation map for this image All hidden layers have rectifi ed linear unit ReLU activations while each output has a separate activation function to limit the ranges The neural network has three output values The fi rst value represents the Grasp Quality Q and has a sigmoid activation to limit the range to 0 1 During training the targets are binary values 0 and 1 for negative and positive samples respectively The two other values represent the grasping angle encoded as two vector components of a unit circle cos 2 with a sigmoid activation and sin 2 with a hyperbolic tangent activation to limit their values to 0 1 and 1 1 respectively This decision has been made to remove discontinuities that would occur when the angle wraps around 2 and has been validated in 11 This choice is further motivated by the fact that continuous distributions tend to be easier to learn for neural networks 33 C Real time Controller At runtime we use a Cartesian velocity controller for the manipulation of the robot arm Once the replay arrives at phover we stream full resolution camera images 640 480 through the neural network The result is a three dimensional feature plane which represents the region activation map in the fi rst dimension Fig 3 and the two vector components for the rotation angle in the last two dimensions We then steer the end effector in the x y plane according to the region activation map calculating the velocity vector 0 0 0 2 0 4 0 6 0 8 1 0 Fig 3 Example of the grasp success activation map feature received from the network for a wire clipper upscaled and overlayed on the input image Darker regions show higher activation points The magenta vector v represents the velocity direction the controller has to take starting from the center of the image v from the center of the image in the direction of the highest activation The distance from the center to the highest activation is used as the magnitude of v Once the highest activation is in the center of the image the vector components are used to calculate the required rotation angle with Eq 1 1 2 arctan sin2 cos2 1 When the controller reaches a point where the velocity and the angle are below a certain threshold the grasp is executed The trajectory of this grasp is the relative displacement from the demonstrated phoverto ptarget This allows the controller to perform a grasp with any orientation demonstrated and avoids the need for calibrating the camera with respect to the robot arm IV DEMONSTRATION SETUP A Hardware Our setup shown in Fig 4 consists of the Franka EMIKA Panda cobot 4 with 7 degrees of freedom with a RealSense D435 RGB D camera mounted on the end effector In our current system the depth information is not used as the RealSense camera does not produce accurate results closer than 200mm and is unable to give valid depth data on all objects To control the robot in real time a Jetson TX2 is used to process the camera frames and perform inference of the trained neural network This results in a closed loop grasping fl ow of 18 38ms on average for each captured RGB frame Training the network is performed on a PC running Ubuntu 16 04 with an Intel R Core TM i5 2400 CPU running at 3 10GHz and a NVIDIA Titan X graphics card Training 2374 a b c d Fig 4 Typical workfl ow a the operator guides the robot arm to a overhead position framing the object in the center of the camera inside the green rectangle b The framework captures the raw image without green rectangle b and starts training the network to frame the object in the same position and orientation Next the operator performs the grasp c and continues with the demonstration d starts when the image is captured and it takes less than 5 minutes to train the network B Test Objects To test our approach we use common household objects e g screwdriver ducktape coffee mug Fig 5 These ob jects vary in size shape and diffi culty to grasp while still being feasible for our gripper to grasp The objects are a subset of the ACRV Picking Benchmark APB 18 and the YCB Object Set 17 C Demonstration Flow The demonstration grasp pipeline consists of three separate stages Fig 4 1 the operator demonstrates the actions 2 the framework records an image of the object and starts training a network 3 execute the demonstrated actions and fi ne tune the position and orientation before grasping 1 Human Input To capture a demonstration the operator guides the robotic arm and end effector in a sequence of actions while recording each pose camera frame and type of action fi xed pick or place The operator needs to focus Fig 5 Test objects for the experiments consisting of some household items 025050075010001250150017502000 Batches 0 2 0 3 0 4 0 5 0 6 0 7 0 8 Validation accuracy Object marker Single Multi Reptile Fig 6 Training performance for the single multi and reptile confi guration The reported accuracy is the model performance on a balanced validation set including all objects The single confi guration performs worse as it has not seen other objects during training Averages over 5 random seeds are shown the object in the center of the captured frame before the pick action Fig 4b The height of the hover position can be chosen by the operator but a higher hover pose gives the camera a wider fi eld of view on the workspace 2 Recording and Training For each recorded pick ac tion a neural network is asynchronously trained with a dataset Fig 1 created from the previous pose s captured frame We train the neural network by minimizing the mean square error between its outputs and the labeled grasp success angle sine and cosines We use the Adam optimizer 34 with learning rate 0 001 and train for 2000 epochs with batch size 128 When multiple demonstrations are available in the system we use Reptile meta learning 35 an extension of the MAML algorithm 36 to learn a good weight initialization which speeds up training convergence The weight initial ization parameters are trained on previously seen objects as described in 31 Next we train for half the number of iterations with a batch size of 128 for the target object The accuracy is evaluated on a held out bal
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- 嵌入式应用场景试题及答案
- 计算机三级考试注意事项试题及答案
- 2024–2025年中国数据标注产业深度分析报告
- 组织中的冲突管理与领导策略试题及答案
- 电子商城销售商品采购合同
- 家庭学习计划(4篇)
- 环境工程水处理技术实务试题
- 物流管理理论及应用测试题
- 测试工具的性能评估方法试题及答案
- 数据库表的设计与增强方法试题及答案
- 个人所得税 个人所得税财产租赁所得应纳税额计算
- 加工中心点检表
- MT 754-1997小型煤矿地面用抽出式轴流通风机技术条件
- GB/T 3863-2008工业氧
- GB/T 18391.1-2002信息技术数据元的规范与标准化第1部分:数据元的规范与标准化框架
- 护理科研选题与论文写作
- 2023年河北泓杉供水有限责任公司招聘笔试模拟试题及答案解析
- 施工现场临电讲解课件
- 淘宝网-信息披露申请表
- 小微型客车租赁经营备案表
- 教育培训机构办学许可证申请书(样本)
评论
0/150
提交评论