IROS2019国际学术会议论文集2290_第1页
IROS2019国际学术会议论文集2290_第2页
IROS2019国际学术会议论文集2290_第3页
IROS2019国际学术会议论文集2290_第4页
IROS2019国际学术会议论文集2290_第5页
免费预览已结束,剩余1页可下载查看

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

PointAtrousNet: Point Atrous Convolution for Point Cloud Analysis Liang Pan1, Pengfei Wang2and Chee-Meng Chew1 AbstractIn this paper, we propose a permutation-invariant architecture - PointAtrousNet (PAN), which focuses on exploit- ing multi-scale local geometric details for point cloud analysis. Inspired by Atrous Convolution in image domains, we propose the Point Atrous Convolution (PAC) operation. Our PAC can ef- fectively enlarge the receptive fi eld of fi lters without introducing more parameters or increasing computation amount. In partic- ular, we propose a novel Point Atrous Spatial Pyramid Pooling (PASPP) module to explicitly exploit neighboring contextual information at multiple scales. Moreover, local geometric details are captured by constructing neighborhood graphs in metric and feature spaces. Experimental results show that our PAN achieves state-of-the-art performance on various point cloud inference applications. I. INTRODUCTION 3D semantic perception is important for many robotics applications. For example, a simple fetching task requires a robot to recognize and localize a specifi c object in 3D scenes. A self-driving car usually integrates a lidar sensor, which can navigate by understanding observed 3D point clouds. Due to the effectiveness in capturing multi-scale spatially- local correlations in 2D image domains, deep convolutional neural networks have yielded impressive results for scene understanding 1, 2. However, conventional convolution operations are not applicable on unordered and irregular 3D points 3. Hence, it is imperative to design deep networks on 3D points for robotics-related applications, such as 3D object classifi cation and 3D semantic labeling. Following the pioneering permutation-invariant network - PointNet 3, researchers apply symmetric functions, such as shared multi-layer perceptron (mlp) and max-pooling, to design deep networks for point cloud analysis. As a follow- up work, PointNet+ 4 proposes a novel set of abstraction layers to combine multi-scale local features hierarchically. Recently, DGCNN 5 introduces a novel EdgeConv opera- tion to capture local edge features in 3D points. However, they 4, 5, 6 either lack the ability on learning edge features or cannot explicitly learn multi-scale point features. On the other hand, atrous convolution is a well-established operation for learning dense image features. Its core idea is to insert holes between non-zero fi lter taps (i.e. fi lter elements/weights), which is equivalent to upsampling the convolution fi lters 7. As illustrated in Fig. 1, an atrous convolution can cover a relatively larger size input signals 1Liang Pan and1Chee-Meng Chew are with Department of Mechanical Engineering, National University of Singapore, 21 Lower Kent Ridge Rd, S.sg 2Pengfei Wang is with the Temasek Laboratories, National Uni- versityofSingapore,5AEngineeringDrive1,117411,Singapore with a small convolution kernel. Consequently, it can ef- fectively enlarge the receptive fi elds to incorporate larger scale contextual information without increasing the amount of parameters. Many networks 8, 9, 10 adopt atrous convolution to gather multi-scale contextual information in images, which provides competitive results on image-based inference tasks. Motivated by the success of atrous convolution, we propose our PointAtrousNet (PAN) - a deep permutation- invariant network that explicitly exploits multi-scale local neighborhood information for unorganized 3D points analy- sis. Following EdgeConv 5, we also extract edge features by concatenating features of both centroid point and its neighboring points. Unlike EdgeConv, we add a sampling rate to select neighboring point features, and thus propose our Point Atrous Convolution (PAC) module. Furthermore, our PAC module can effi ciently enlarge the fi eld of view of fi lters in 3D points (shown in Fig. 2). Particularly, we propose Point Atrous Spatial Pyramid Pooling (PASPP) module (shown in Fig. 3) to explicitly extract multi-scale local edge features. In addition, our PAN searches for neighboring point features in metric spaces and feature spaces. Experimental results show that, our PAN achieves state-of-the-art performance on various 3D point tasks, including object classifi cation, object- part segmentation and semantic segmentation. II. RELATED WORK A. Point Cloud Analysis 3D Points can be organized by using volumetric grids for deep learning. VoxNet 11 introduces an architecture to tackle this problem by integrating a volumetric occupancy grid representation with a supervised 3D convolutional neural network. 3D ShapeNets 12 applies a Convolutional Deep Belief Network to represent a geometric 3D shape as a probability distribution of binary variables on a 3D voxel grid. Point clouds are straight-forward representations of un- ordered 3D points. A network should respect the permutation invariance of N 3D input points, which means that a network needs to be invariant to N! permutations of the input set in data feeding order 3. Qi et al. propose the pioneering work PointNet 3, which maintains the permutation invariance property by applying many symmetric functions. In Point- Net+ 4, Qi et al. applies many mini-PointNet to adaptively combine multi-scale local features. SPLATNet 13 learns hierarchical and spatially-aware features by using sparse bilateral convolutional layers. SpiderCNN 14 designs the fi lter as a product of a simple step function and a Taylor polynomial, which captures local geodesic information and ensures the expressiveness, respectively. DGCNN 5 and IEEE Robotics and Automation Letters (RAL) paper presented at the 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) Macau, China, November 4-8, 2019 Copyright 2019 IEEE rate = 2 (a) 1D atrous conv.(b) 2D atrous conv. Fig. 1. Atrous convolution. The applied atrous convolution kernels have the same sampling rates (r = 2). For the 1D case, the selected 3 signals (shown in red) cover 5 input signals (larger fi eld of view), which is equivalent to inserting 2 holes (zero) between 3 non-zero fi lter taps (the fi lter elements corresponding to the selected 3 signals shown in red). Similarly, the selected 9 signals (shown in red) cover 25 input 2D signals (larger fi eld of view). KCNet 6 construct neighborhood graphs to exploit local neighboring contexts. However, lacking different-scale recep- tive fi elds limits their abilities on exploiting multi-scale local geometries in 3D points. B. Atrous Convolution Atrous convolution 15, also known as dilated convolu- tion, has been referred as “convolution with a dilated fi lter” in the past 16. Chen et al. 17 have shown that atrous convolution is benefi cial to incorporating larger scale con- textual information. Yu et al. 16 employ a series of atrous convolutional layers with increasing rates to aggregate multi- scale context. Later, Chen et al. 7 propose atrous spatial pyramid pooling scheme (ASPP) by laying out multiple atrous convolutional layers with different sampling rates in parallel. DeepLabv3 9 and DeepLabv3+ 10 apply both an ASPP module and a deep encoder-decoder architecture to further improve their performance by exploiting multi- scale contextual information in image domains. The effec- tiveness of atrous convolution has been demonstrated for many image-based tasks, such as object detection 18, 8 and segmentation 19, 9, 10. In view of this, we follow the idea of atrous convolution to enlarge the receptive fi elds for fi lters, which benefi ts in learning multi-scale features for point cloud analysis. III. REVISITATROUSCONVOLUTION Atrous convolution has been actively explored on image- based tasks, especially semantic segmentation. We are going to illustrate the atrous convolution operation by a simple ex- ample on one-dimensional signals 7, as shown in Fig. 1(a). The output signal yi is computed by the following equation: yi = K X k=1 xi + r kwk,(1) where x is the input signal, w is the one-dimensional fi lter, K is the amount of the weights in a kernel and r is the sampling rate with which we sample the input signal. i and k are applied to indicate a certain signal and a certain weight of the input signal and the applied kernel, respectively. In order to enlarge the fi elds of view for convolution fi lters, one method is to perform convolution operations over sparser input signals. However, atrous convolution can be performed with large receptive fi elds without losing signal density. In the same spirit, 2D atrous convolution (shown in Fig. 1(b) samples the input image-based feature maps along both x = 5, = 1 1,2,3,4,5 = 5, = 2 2,4,6,8,10 Our PAC in a Real Point Cloud (a chair) Fig. 2.Point Atrous Convolution (PAC). We encode the local edge features by concatenating centroid point feature p and its k neighboring point features. qidenotes the ithnearest neighboring point feature of p. We propose a sampling rate r to select every rthnearest neighbor point feature from p, i.e. qr,qr2, ,qrk. An example that sampling rate equals to 2 is presented in the middle. Suppose k = 5, the selected neighboring points (shown as blue circles) are q2,q4,q6,q8,q10. An example of our PAC for a real point cloud is illustrated in the right fi gure. and y directions. Therefore, 2D atrous convolution can be applied to high resolution feature maps to extract dense image features. IV. OUR POINTATROUSNET Following previous works 3, 5, our PointAtrousNet (PAN) also applies those symmetric operations: feature con- catenation, shared mlp and max-pooling. Different from pre- vious works, our PAN adopts an encoder-decoder architec- ture (especially our segmentation network), which searches neighboring point features in feature spaces and metric spaces. Specifi cally, we propose Point Atrous Convolution (PAC), which can effectively enlarge receptive fi elds with- out increasing required neighboring point features. Further- more, we propose the Point Atrous Spatial Pyramid Pool- ing (PASPP) module, which explicitly encodes multi-scale point features by exploiting neighboring points with different sampling rates. In addition, we do not require subsampling or upsampling operations, which benefi ts in learning dense features. A. Point Atrous Convolution Based on exploiting neighboring point features for local edge feature encoding, previous networks often construct large neighborhood graphs (e.g kNN = 20) to exploit suf- fi cient contextual information 5, 14. Another strategy is to encode multiple neighboring point features with different radii, and then concatenate them together 4. However, these operations make their networks ineffi cient with respect to required training time and consumed training memory. We propose Point Atrous Convolution (PAC) operation, which can arbitrarily enlarge the fi eld-of-view of fi lters by select- ing neighboring points. In our PAC, each input point is considered as a centroid point. In particular, we apply the sampling rate parameter r to equivalently sparsely sample the neighboring point features in feature spaces or metric spaces. Therefore, we do not increase the amount of parameters or our computation load, which is an important advantage, especially for large-scale point cloud analysis. Our PAC operation is defi ned as: X0 p= g(H(Xp,Xqr),.,H(Xp,Xq(rk), (2) where H() denotes the edge kernel h?Xp(XpXqi)?, Xpis the feature of this centroid point p, Xqrkis the Point Atrous Spatial Pyramid Pooling (PASPP) = 5, = 2 2,4,6,8,10 = 5, = 1 1,2,3,4,5 = 5, = 3 3,6,9,12,15 = 5, = 4 4,8,12,16,20 Fig. 3.Point Atrous Spatial Pyramid Pooling (PASPP). Our PASPP consists of 4 paralleled Point Atrous Convolution layers with the same number of selected neighboring points (e.g. k = 5 shown in the fi gure) and different sampling rate (e.g. r = 1,2,3,4 shown in the fi gure). feature of point qrkthat is the (r k)thnearest neighbor of point p, r is the sampling rate, k is the number of total searched neighboring points and g() denotes a max-pooling function. Also, his a shared mlp and denotes the feature concatenation. If we set the sampling rate equal to 1, our PAC will degenerate into the normal EdgeConv operation. As illustrated in Fig. 2, we apply our PAC to select 5 neighboring point features with sampling rate r = 2. As a consequence, the selected neighboring point features will be q2,q4,q6,q8,q10, where qidenotes the ithnearest neigh- boring point feature of a certain centroid point p. Without sampling neighboring points (equivalent to r = 1), we will select q1,q2,q3,q4,q5 instead. Therefore, we can increase the sampling rate r to arbitrarily enlarge the fi eld of view of fi lters in PAC operations without additional computation (fi xed k). B. Point Atrous Spatial Pyramid Pooling Inspired by the atrous spatial pyramid pooling (ASPP) module 7, we propose our Point Atrous Spatial Pyramid Pooling (PASPP) module to learn edge features at multi- ple scales. Our PASPP explicitly captures multi-scale local contextual information in point clouds by applying multiple parallel PAC layers with different sampling rates. The edge features extracted with different sampling rates are further processed and then fused together to generate the multi-scale edge features densely associated with each point. In par- ticular, our PASPP module searches for neighboring points by constructing neighborhood graphs in metric spaces rather than in feature spaces. In this way, we also take the structural relationships of the input 3D points into consideration. In our experiments, we usually apply 4 parallel PAC layers in our PASPP module as shown in Fig. 3. The operation of our PASPP is then given as: X0 p= X 0 p1 X 0 p2 X 0 p3 X 0 p4, (3) X0 pi= g(H(Xp,Xqri),.,H(Xp,Xq(rik), (4) where X0 pis the output feature of point p, and riis a certain sampling rate in the ithbranch. C. Our PointAtrousNet Architecture As shown in Fig. 4, our classifi cation network and segmen- tation network share the same encoder architecture, which encodes global point features by exploiting neighborhood contexts at multiple scales. First, we lay out 4 successive PAC layers with increasing sampling rate. Neighborhood graphs of these PAC layers are constructed in feature spaces. After- wards, we add a PASPP module to explicitly extract multi- scale edge features for each 3D point. Thereafter, we apply a shared mlp layer to propagate each point feature to high- dimensional feature space, which is followed by a global max-pooling operation to encode the global point feature. Hence, classifi cation tasks are performed by regressing the encoded global point feature. Our PAN for segmentation tasks follows an encoder- decoder architecture. After the encoder, we concatenate the regressed global feature with each point feature of the output from our PASPP module, which is treated as the input for our decoder. Similar to the encoder, our decoder has 4 successive PAC layers with decreasing sampling rate. We also add many skip links to directly propagate the extracted point features from our encoder to the corresponding layer in the decoder. These directly propagated point features are concatenated with the previous output point features, which is then treated as the input for current PAC layer in our decoder. Segmentation tasks are treated as per-point classifi cation in our experiments. Therefore, we progressively propagate point features and then concatenate those point features from each PAC layer with different sampling rate of our decoder to perform the fi nal per-point inference. D. Discussion Previous deep networks 5, 4, 14 have revealed the ef- fectiveness on exploiting local geometrical details to improve the understanding for 3D points. However, PointNet+ 4 requires an extremely large training memory and a long training time. Other networks, such as DGCNN and Spi- derCNN, cannot explicitly learn multi-scale edge features. On the other side, atrous convolutions have been validated and successfully applied for many image-based applications due to its capability of exploiting multi-scale local contexts. In view of this, our PAN extends the idea of 2D atrous convolution and propose our PAC module that can exploit local geometric details at multiple scales in 3D points. Specifi cally, our PAC module can enlarge the receptive fi eld of our convolution fi lter without increasing the computation Input Points Segmentation Classification Per-point scores Class scores FC + ReLU + dropout Max pooling Point Atrous Conv. Shared mlp Concatenation Point Atrous Spatial Pyramid Pooling = 1 = 4 = 2 = 8 Fig. 4. Our classifi cation and segmentation networks share the same encoder architecture, which consists of 4 PAC, a PASPP, a shared mlp, a global max-pooling and two fully-connected modules. Our classifi cation network directly regress the global point features generated by our encoder. However, our segmentation network should propagate high-dimensional features for each point. Hence, we concatenate the global feature with each point feature from our PASPP as the input to the decoder. Note that we take the concatenation operation out of our PASPP module in this fi gure for better illustration. TABLE I SHAPE CLASSIFICATION RESULTS ONMODELNET40 12BENCHMARK. MethodInputUp-axis RotationAcc. PointNet 31024 pointsX89.2 PointNet+ 41024 pointsX90.7 DGCNN 51024 pointsX92.2 PAN(Ours)1024 pointsX92.2 PointNet+ 45000 points +

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论