




已阅读5页,还剩3页未读, 继续免费阅读
版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
PPR-Net:Point-wise Pose Regression Network for Instance Segmentation and 6D Pose Estimation in Bin-picking Scenarios Zhikai Dong1, Sicheng Liu1, Tao Zhou2, Hui Cheng3, Long Zeng1, Xingyao Yu2, Houde Liu1 AbstractAccurate object 6D pose estimation is a core task for robot bin-picking applications, especially when objects are randomly stacked with heavy occlusion. To address this problem, this paper proposes a simple but novel Point-wise Pose Regression Network (PPR-Net). For each point in the point cloud, the network regresses a 6D pose of the object instance that the point belongs to. We argue that the regressed poses of points from the same object instance should be located closely in pose space. Thus, these points can be clustered into different instances and their corresponding objects 6D poses can be estimated simultaneously. In our experiments, PPR-Net outperforms the state-of-the-art approach by 15% - 41% in average precision when evaluated on the benchmark Sil eane dataset. In addition, it also works well in real world robot bin-picking tasks. I. INTRODUCTION Vision guided robot bin-picking has diverse industrial applications in manufacturing and logistics. Its core problem is object instance segmentation and 6D pose estimation from point cloud. In typical bin-picking applications, a pile of objects from multiple categories are randomly stacked which exhibit heavy occlusion and a robot is required to pick up a target object from the pile. The heavy occlusion makes instance segmentation and 6D pose estimation quite challenging. This paper proposes a method that can segment individual object instances, recognize their categories and estimate their 6D poses simultaneously. These information can be sent directly to the robot for bin-picking tasks. Previous works usually treat segmentation and pose esti- mation as a feature matching problem between point clouds or images and their templates respectively 1, 2, 3. However, they are sensitive to occlusion induced by stacked objects. Therefore, more recent works 4, 5 adopted deep learning approaches and regarded this problem as a combina- tion of object detection (or segmentation) and pose regression (or pose classifi cation for discrete pose prediction) in an end- to-end manner. These works take RGB or RGB-D image as input in order to utilize color and texture information of objects, which is often not available for mechanical parts under industrial settings. Sock et al., 6 proposed a network for depth-image inputs and achieved impressive result on * These authors contribute equally. Corresponding author. This work was done during internship of Zhikai Dong and Sicheng Liu at SenseTime Research. 1International Graduate School at Shenzhen, Tsinghua University dzk17,liusc17 zenglong,liu.hd 2SenseTime Research,zhoutao yuxingyao 3Sun Yat-Sen University,chengh9 (a) (b) Fig. 1: Instance segmentation and pose estiamtion for point clouds using PPR-Net. (a) Result on real world robotic bin-picking trial. (b) Result on object Bunny from Sil eane dataset. industrial application orientated bin-picking dataset. How- ever, these image based methods essentially treat 3D scenes as a regular grid structure, where information is encoded as pixel values and relationship in grid space, which does not refl ect genuine relationship between points in 3D space. In contrast, learning in point cloud representations can fully exploits geometric and topological structure of 3D space and may lead to better performance, which has been supported by recent development in other related fi elds such as 3D object detection 7. In this work, we propose a Point-wise Pose Regression Network (PPR-Net), a deep network that performs instance segmentation and pose estimation from point cloud. The intuition behind PPR-Net is straightforward: if our network can predict point-wise feature that are close to each other if and only if the points belong to the same object instance, it would be easy to segment the point cloud into different 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) Macau, China, November 4-8, 2019 978-1-7281-4003-2/19/$31.00 2019 IEEE1773 object instances by clustering in corresponding feature space. Specifi cally, we choose 6D pose of each points correspond- ing object as the predicted feature for PPR-Net since it is unique for different rigid object instances. Defi ning loss function for pose regression is a non- trivial problem since it requires proper handling of object symmetries which is common among industrial objects. In our network, we adopt the pose distance metric proposed in 8 as our loss function. This metric takes object shape and symmetry into account and represents poses as a set of points in Euclidean space, which allows us to train our network effi ciently. When evaluated on Sil eane dataset 9, our method out- performs previous state-of-the-art by 15%-41% in average precision. We also applied our method to a real world robot bin picking pipeline in which we use PPR-Net to predict poses before motion planning and grasp execution. As shown in Fig.1(a), test scene contains multiple instances of two types of objects. This pipeline successfully estimated poses and correctly picked up objects in all 20 trials which demonstrates the robustness and usefulness of PPR-Net in real world application. In summary, the main contributions of our work are: We introduce a novel point-cloud based network, Point- wise Pose Regression Network (PPR-Net), for instance segmentation and 6D pose estimation that can handle heavy object occlusion. We employ a compact yet powerful pose distance metric as pose loss which enables us to handle symmetrical objects and increase computational effi ciency. We demonstrate signifi cant performance improvements of our approach (up to 41%) comparing with the state- of-the-art approaches on Sil eane dataset and show that it works well in real world bin-picking application. II. RELATEDWORK A. Object segmentation Object segmentation methods have been widely used in different robot bin-picking tasks. During Amazon Picking Challenge (APC), several teams 10, 11 utilized similar approaches where segmentation was carried out fi rst before estimating the pose of each object instance. In a general robot bin-picking task, scenes may contain multiple instances of multiple types of objects, thus instance- level segmentation is needed before performing pose es- timation. Deep learning based approaches have shown re- markable performance on segmentation tasks. Long et al. 12 proposed fully convolutional networks (FCN) for pixel- wise semantic classifi cation. Grard et al. 13 extended this work by semantically segmenting edges of stacking objects in depth image using FCN. In this way, different instances can be separated according to connectivity of the predicted masks. He et al. 14 presented an instance segmentation network Mask R-CNN, which combines object detection with semantic segmentation to produce individual mask for each instance in RGB images. To facilitate robot grasping, Danielczuk et al. 15 trained an adaptation of Mask R-CNN designed to perform deep category-agnostic object instance segmentation on depth images. In recent years, deep learning on point clouds has also gained great momentum thanks to seminal works such as PointNet 16 and PointNet+ 17. PointNet extracts fea- tures from raw point cloud input using shared multi-layer perceptron and symmetry functions and achieves impressive result on classifi cation and semantic segmentation tasks. However, it treats each point individually, essentially dis- carding local geometric information. Several following works try to fi x this problem by either applying PointNet on local point sets and hierarchically aggregating them for feature ex- traction 17, 18 or defi ning convolutional operation which considers relationships between neighboring point pairs 19, 20. Based on 16, 17, Wang et al. 21 proposed SGPN, an instance segmentation architecture which generates group proposals by similarities between point pairs in embedded feature space. B. Pose estimation Typically, pose estimation is based on point cloud registra- tion, which aims to fi nd a spatial transformation that aligns two point sets. Rusu et al. 1 proposed a coarse registration approach SAC-IA, which exploits hand-crafted local features FPFH for point pair matching and uses RANSAC for pose hypotheses estimation. Coarse alignment can then be refi ned by ICP-based methods 22. Drost et al. 3 proposed to use Point Pair Features (PPF) for building a hash table as object model global descriptors and retrieve poses from scene point cloud via voting scheme, which is then extended by 23, 24 for better performance under noise and occlusion. Work in 25 use point cloud convolutional neural network to regress the 3D rotation of pre-segmented point sets, which demonstrates the effectiveness of point cloud deep learning in pose regression task. Although approaches in 1, 3, 23, 24 do not need segmentation in advance theoretically, their performance deteriorate under complex scenarios due to the fact that similar-looking objects or instances of same object exhibit similar features. Recently, there is an increasing trend to- wards integrating pose estimation with object segmentation 26, 5 or detection 27, 2, 4, 6 in order to directly proposed pose hypotheses from complex environment. Hinterstoisser et al. 2 proposed a template matching method based on LINEMOD feature 27, in which tem- plates comprised of densely sampled image gradients and depth map normals slide over the input RGB image for simultaneous detection and pose estimation. However, such templates are sensitive to occlusions. To tackle this problem, approaches 26 combines LINEMOD feature 27 with Hough Forest 28. It trains a Lantent-Class Hough Forest using synthetic images. During inference stage, patches are randomly sampled from test images to produce fi nal predic- tion. Latest works utilize image-based convolutional neural net- work for this purpose. SSD-6D 4 extends classical image 1774 Fig. 2: Architecture of our pipeline for instance segmentation and pose estiamtion. PPR-Net produces dense prediction for each point from the input cloud, which can be used to generate instance segments and pose hypotheses. detector SSD by regressing object poses for estimated object bounding boxes. Xiang et al. 5 presents a network with semantic segmentation branch, translation prediction branch and rotation prediction branch. The translation prediction branch outputs translation of each pixel for voting in hough space, and rotation prediction branch directly regresses quaternion for each instance. Sock et al. 6 proposed a multi-task network for bin-picking scenarios where multiple instances of an object are piled randomly. It consumes depth image as input and jointly performs tasks of 2D detection, depth prediction and pose estimation and achieves state-of- the-art result on Sil eane Dataset for object detection and pose estimation 9. Our work extends the idea of Hough Voting from ap- proaches 29, 26, 5 to point clouds using deep learning framework and outperforms both traditional methods 3, 2, 30 and image-based deep learning method 6 by a large margin in public benchmark for bin-picking scenarios 9. III. METHOD This paper proposes a framework for simultaneously segmenting and estimating 6D poses in general industrial bin-picking scenarios, where multiple instances of different types of objects are stacked randomly into a pile. Our proposed framework directly takes 3D point cloud as input and produces object segments and their corresponding poses for every visible instances in the scene. Utilizing recent development in deep learning on point clouds, we introduce a novel yet simple architecture called Point-wise Pose Regression Network, which consumes a point cloud and regress 6D pose of the object to which the point belongs for each point of the input. Intuitively, the output of the network will be clustered in pose space, thus we can achieve instant-level segmentation and pose hypotheses. Section III-A proposes the architecture of Point-wise Re- gression Network. Section III-B introduces distance metric in pose space. Section III-C and Section III-D give details on segmentation and pose estimation as well as our imple- mentation respectively. A. Point-wise Pose Regression Network As shown in Fig.2, our proposed architecture begins with a Point-wise Pose Regression Network (PPR-Net), which jointly learns the tasks of pose estimation, semantic seg- mentation and visibility prediction. It fi rst feeds raw point cloud of cluttered scene PC of size Npthrough a feed- forward network for feature extraction. We adopt PointNet+ 17, a deep learning network that is capable for extracting both global and local features from point sets as backbone of PPR-Net. The extracted features (denoted as Fe) has the size of NpNeand each row represents the corresponding point in embedded feature space. Then our network diverges into four branches which consumes Fewith shared multi- layer perceptron (MLP) to obtain pose transform matrix, visibility prediction and semantic classifi cation for each point. Therefore, total loss L of our network is given by the weighted sum of losses of transformation regression branches LP, visibility prediction branch LVand semantic segmentation branch LSEM: L = LP+LV+LSEM.(1) where and are scaling constants. Branch for Semantic Segmentation For cluttered scenes where instances of multiple types of objects exist, the seman- tic segmentation branch is needed to perform classifi cation for each point. We pass extracted features Feof size NpNe through an MLP and produce the semantic prediction S of size NpNc, where Ncis the number of different object 1775 classes. S indicates the type of object to which each point belongs and element Sijrepresents probability that ith point belongs to object of class j. The loss LSEMis the sum of the cross entropy softmax loss between predicted S and ground truth labels. Since a-priori information such as semantic label may be benefi cial for other learning tasks, we further concatenate Feand S to produce semantic-class-aware features Fec, which serves as input for other branches of the network. Branches for Transform Regression We feed features Fec into two separate MLP. The fi rst one is used for regressing center position of the object to which each point belongs. The second MLP predicts rotation in Euler angles. The author of 25 argues that learning the axis-angle representation of ro- tation is a better choice. We conducted extensive experiments in our framework and observe no obvious difference between learning from different rotation representation. Since Euler angles are more intuitive for humans, we choose Euler angles representation. Rigid transform T can be obtained by combining transla- tion and rotation predictions. For rigid objects, transform T corresponds to a unique pose P, which can be represented in Euclidean space as a fi nite set of points R(P) at most 12 dimensions. R(P) enables us to compute pose difference for any type of symmetric objects effi ciently by calculating Euclidean distance between pose representitives, details of pose metrics as well as R(P) can be found in Section III- B. For vanilla version of our network, loss for transform regression branches LPis given by the sum of pose distance between prediction and ground truth for all point in point cloud PC: LP=dist(R(Ppred),R(Pgt).(2) Branch for Visibility PredictionIn heavily cluttered scenarios, many instances suffer from extreme occlusion. For bin picking tasks, such objects are not in our interest since high occlusion rate implies that they lie in the bottom of objects in bulk and thus ungraspable. More importantly, these objects usually lack suffi cient information for target pose estimation, which may mislead learning process. To alleviate the impact of confusing information introduced by occluded objects, we introduce visibility V into our network and add an extra MLP to regress it from semantic-class-aware features Fec . The visibility of a point refl ects occlusion degree of the corresponding instance. Let Nidenotes the number of points of the instance to which ith point belongs and Nmax denotes the number of point of instance with most points in the scene. Therefore, visibility V can be simply approximated by the ratio of Nito Nmax: Vi= Ni Nmax .(3) For modifi ed network with visibility prediction, loss for transform regression branches LP can be defi ned as pose difference weighted by ground-truth visibility, which indi- cates that instances with relatively complete surfaces will play more important role during training. Visibility loss LV is given by distance between the expected and inferred V: LP=Vgt i dist(R(Ppred),R(Pgt).(4) LV=kV pred i Vgt i k2.(5) During inference, we utilize estimated visibility to fi lter out points introduced by sensor noise and severe occlusion for better performance. B. Pose Metrics and Distances Romain Br egier et al. 8 defi ned a pose distance which can be evaluated effi ciently using a representation of poses within a Euclidean space of at most 12 dimensions depending on the objects symmetries. Let be the symmetries positive semi-defi nite square root matrix of the covariance matrix of the objects weighted surface and it can be obtained by equation 6: , (1 S Z S (x)xxTds) 1/2 (6) where S is the set of points of the object at reference pose, and is a positive density distribution defi ned on S. can be calculated directly from the triangular mesh fi le of objects. Transform matrix (R,t) can be represented by a 12- dimensional Euclidean space: R(P) , (vec(RG)T,tT)T|G G R12(7) where
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- 纺织品设计中的共享经济理念试题及答案
- 垃圾合同协议书
- 村庄保洁合同协议书
- 院子继承合同协议书
- 合同书股份协议书
- 水库合同协议书
- 供货协议书合同范本
- 雇佣送货司机合同协议书
- 图书代销合同协议书
- 储蓄合同协议书
- 走进重高-培优讲义-数学-八年级-上册-(浙教版)
- 初中物理竞赛练习题集(共14讲)含答案
- 2024年江苏省南京市联合体中考三模英语试题(解析版)
- 四年级数学脱式计算练习题100道
- 创新与发明-按图索骥、循章创新智慧树知到期末考试答案章节答案2024年广州大学
- 《24时计时法》素养课件
- 2024年山东高考化学真题试题(原卷版+含解析)
- 3.1.4 禁止编入列车的机车车辆课件讲解
- 30题仪表工程师岗位常见面试问题含HR问题考察点及参考回答
- 电力安全工作规程发电厂和变电站电气部分
- 数字贸易学 课件 第5章 数字服务贸易
评论
0/150
提交评论