IROS2019国际学术会议论文集0365_第1页
IROS2019国际学术会议论文集0365_第2页
IROS2019国际学术会议论文集0365_第3页
IROS2019国际学术会议论文集0365_第4页
IROS2019国际学术会议论文集0365_第5页
免费预览已结束,剩余1页可下载查看

付费下载

VIP免费下载

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

Abstract Place recognition as one of the most significant requirements for long-term simultaneous localization and mapping (SLAM) has been developed rapidly in recent years. Also, deep learning is proved to be more capable than traditional methods to extract features under some complex environments. However, in real-world environments, there are many challenging problems such as viewpoint changes and illumination changes. The existing deep learning-based place recognition in extracting feature phases and matching process is both time-consuming. Moreover, features extracted from convolution neural network (CNN) are floating-point type with high dimension. In this paper, we propose deep supervised hashing for place recognition, where we design a similar hierarchy loss function to learn a model. The model can distinguish the similar images more accurately which is well suitable to place recognition. Besides the model can learn high quality hash codes by maximizing the likelihood of triplet labels. Experiments on several benchmark datasets for place recognition show that our approach is robust to viewpoints, illuminations and season changes with high accuracy. Furthermore, the trained model can extract features and match in real time on CPU with less memory consumption. I. INTRODUCTION SLAM has been more and more popular because of its application in robot and augmented reality 1. The task of visual place recognition is to recognize a place in the environment through vision sensors. Place recognition in a loop closure detection for SLAM can be considered an image retrieval problem which matches current scene to a previously visited location 2. In past few years, many algorithms have exploited this problem basing on comparing images as numerical vectors in bag-of-words space built on local features 3-5. Recent progress in the computer vision and machine learning has shown that features generated by convolutional neural networks outperform other hand-crafted features in visual recognition, classification, and detection tasks 6, 7. State-of-the-art performance in place recognition can be achieved by utilizing features from CNNs 8 . Hashing has attracted much attention because of the rapid growth of image and video data on the web. It is one of the most popular techniques for image or video retrieval due to its low computational cost and high storage efficiency. Briefly speaking, hashing is used to encode high dimensional data into Lang Wu is with School of Software University of Chinese Academy of Sciences, China (* corresponding author, e-mail: yhwu ) a set of binary codes which preserves similarity of images. Compared to Euclidean distance between two high dimensional image features, Hamming distance of binary codes can speed up the matching process. Thus, it is one of the most popular and powerful techniques for approximate nearest neighbor (ANN) search. However, all hashing methods focus on classifying the images for retrieval problem. There is no attempt on other tasks such as the place recognition. In this paper, we first analyze the difference between image retrieval and place recognition. Then, a similar hierarchy deep hashing is proposed to solve the place recognition task. The contributions of this work are summarized as follows. (1) We use a deep supervised hashing to solve place recognition problem with drastic viewpoint, and illumination and environment changes. (2)A novel method called similar hierarchy deep hashing is proposed to make deep hashing more suitable for place recognition. (3)We design a novel triplet loss function. (4) Experiments obtain competitive performance in benchmark dataset, meanwhile the network can extract features and match in real time on CPU with less memory consumption. II. RELATED WORK We mainly utilize hashing technology to fulfill the place recognition. Thus, the related works are introduced from two aspects: (1) place recognition (2) hashing. A. Place Recognition Place recognition based on appearance has obtained great attention in the robotics community. In fact, images corresponding to the same place may present entirely different appearance. Many studies concentrate on visual bag-of-words by hand-crafted local features 4, 9. An example is FAB-MAP 4 that is well suitable to gentle changes in viewpoint, illumination and places. Instead of calculating the most similar location from a given single image, SeqSLAM 10, 11 selects the best candidate to match place from image sequence. This approach achieves remarkable results with condition change and even season change. With success of CNN in recent years, some works introduce CNN-based place recognition and confirm that features extracted from CNNs can get superior performance 12-16. B. Hashing Hashing methods for image retrieval can be grouped into two categories: data independent methods and data dependent methods. Data independent methods such as locality sensitive hashing (LSH) 17 need longer hash codes to attain high accuracy. Data dependent methods refer to using training data to learn hash functions, which can be categorized into supervised and unsupervised methods. Compared to unsupervised methods, supervised methods can get better Deep Supervised Hashing with Similar Hierarchy for Place Recognition Lang Wu and Yihong Wu* 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) Macau, China, November 4-8, 2019 978-1-7281-4003-2/19/$31.00 2019 IEEE3781 performance with fewer hash codes. Recently, hashing method based on deep learning have shown superior performance over the traditional hashing methods. convolutional neural network hashing (CNNH) 18 is the first deep hashing method with supervised by triplet labels. Network in network hashing (NINH) 19 presents a ranking loss improving performance of CNNH. Some methods such as DSRH 20 and SSDH 21 utilize semantic information to learn hash function. In some supervised hashing methods, the label is in the form of pairwise 22 or triplet 19, 23 that indicate semantic similarities between images. Both DPSH and DTSH can learn hash function by maximizing a posterior and get superior performance. III. APPROACH In traditional image retrieval tasks, the distance between two images from different classes inter-class dshould be long while the distance between two images from a same class intra-class dshould be short in a Hamming space. The distances should satisfy inter-classintra-class dd, where is the margin. Place recognition is different from image retrieval. It cannot be solved by traditional image retrieval methods. Reasons are as follows: (1) Images in place recognition are continuous without clear boundaries for different classes; (2) The number of image classes in place recognition is not clear; (3) Traditional image retrieval just focuses on whether two images are similar or not but ignores similarity degrees among similar images. We propose similar hierarchical method, which can distinguish the similar images. Fig. 1 (a) shows that continuous images from a video in place recognition task. The image t x is similar to both 1t x and 2t x, which means images t x, 1t x and 2t x are in the same class, while t x is more similar to 1t x . On the other hand, t x is dissimilar to 3t x and 2t x is similar to 3t x which means t x, 2t x and 3t x are in different classes. This is a contradiction. The proposed similar hierarchy method can solve this contradiction. Assume a robot has visited some places 1 N t t Xx where N denotes the number of total images. Let 1 N t t Bb , where0,1 L t b denotes the corresponding hash code, t denotes timestamp, and L is the length of hash code. In Fig. 1 (a), the robot moves continuously, so the image t x corresponding to location at time t is similar to 2t x , 1t x , 1t x and 2t x but far away from t d x and t d x, etc., where d is minimum time interval between two dissimilar images. For example, if two images are considered similar within 2 frames, d is 3. Furthermore, t x is closer to 1t x than to 2t x, but both of them are far from t d x. Obviously, t b is more similar to 1t b than 2t b, and both of them are dissimilar to t d b. Two conditions need to be satisfied in place recognition by using a deep supervised hashing: (1) In order to further distinguish similar images, the distance between two hash codes needs to increase with decrease of similarity degree. This property is called similar hierarchy. (2) In order to use hash codes effectively, the distance between any two hash codes of dissimilar images needs to be same. Lets assume t x is similar to 1t x and 2t x, but dissimilar to 3t x as shown in Fig. 1 (a). Hash codes are shown in Fig. 1 (b), where blue denotes 1 and yellow denotes 0, the number of different colors in corresponding positions denotes Hamming distance. We can see that the Hamming distance between two adjacent locations is 2, where is a hyper-parameter. When ,1 2 t t dist and ,2 4 t t dist , then condition (1) is satisfied, where , i j dist denotes the distance between i x and j x. Meanwhile, condition (2) is satisfied when ,3t t dist , ,4t t dist , ,5t t dist all equal 6. Given the location at time t+2, 1t x and 3t x are first retrieved, and then t x and 4t xare retrieved later, according to their hash codes. Through above analysis, the similar hierarchy can well solve the unclear boundaries problem in place recognition. x xt tx xt t+ +1 1x xt t+ +2 2x xt t+ +3 3x xt t+ +d d (a) b bt t b bt t+ +1 1 b bt t+ +2 2 b bt t+ +3 3 b bt t+ +4 4 b bt t+ +5 5 (b) Fig. 1. Images in place recognition and their hash codes. (a) isan example of images in place recognition from a video sequence. (b) denotes hash codes corresponding to continuousimages We use triplet network to handle the unclear number of image classes. There are N image samples 1 N d N ii XxR , M triplet labels 1 , , ( ,) M iiiiiii Tq p nq p nX and M similarity degree 1 ii M q p i , where i q is ith query image, i pis the positive image sample, i nis the negative image sample, ii q p is similarity degree between i q and i p, i q is more similar to the positive image i p than to the negative image i n. The smaller similarity degree, the more similar they are. The aim is to learn hash codes 0,1L NB , where the length of binary codes is L. The ith column 0,1L i b of B denotes the binary code for the ith image. Actually, for a short video in a period of time, the distance can be defined by time interval of two locations. For two binary codes i b and j b, the Hamming distance is formulated as 2 , 2 ij b bij distbb, where 2 2is 2 l norm. Our goal is to learn hash codes B while , ii qp dist preserves the information , ii qp . Thus, , ii q n dist and , ii p n distshould be larger than , ii qp dist. Most existing deep hashing methods are proposed to learn image features and hash codes from images 20, 21 simultaneously. In this paper, we utilize triplet labels to learn above with an end-to-end manner. As shown in Fig. 2, the model consists of three components: (1) image feature 3782 learning component, (2) hash code learning component, (3) loss function component. CNN CNN CNN Loss Function W W Hash codes Hash codes Hash codes query positive negative Image Feature LearningHash learningLoss function q p n bq bp bn Fig. 2. Overview of end-to-end manner to learn hashing Image Feature Learning This component is designed to produce the effective image features through CNNs. Most previous hashing methods adopt AlexNet or CNN-F architectures in this component. MobileNet based on depthwise separable filters and pointwise separable filters was proposed in 24. Compared to AlexNet and CNN-F, MobileNet is more accurate with less model parameters and computation consumption. We use MobileNet as feature extractor in place recognition and get real-time performance on CPU. For more detail information about MobileNet, please refer to 24. Hash Code Learning This component is designed to learn hash codes. The activation in MobileNet output layer for classification is replaced by sigmoid function. In particular, the number of nodes in the last fully connected layer equal to the length of the target hash codes. Loss Function According to the previous analysis, distance between two hash codes of images i and j can be expressed as: , max 2 if i and j is similar 2 if i and j is dissimilar ij i j dist (1) where , i j denotes the similarity degree of two similar images, max larger than denotes the threshold to distinguish similar and dissimilar images, and is a constant which can control the Hamming distance of two hash codes corresponding to similar images. As mentioned above, t x is similar to 1t x and 2t xbut dissimilar to 3t x . In this paper, we choose max is 3, which will be explained in Section IV. Different to the traditional triplet loss function, the proposed method can adjust the Hamming distance of two similar images through similarity , which is more appropriate to a continuous process. Inspired by DPSH 22 and DTSH 23, we design our loss function based likelihood. Given the triplet labels T and distance truth , Maximum a Posterior (MAP) estimation of hash codes can be represented as: , ( , | ) ( ) ( | , )=( , | ) ( ) ( , ) =( ,|) ( ) ii ii tT p TB p B p B Tp TB p B p T p tB p B (2) where ( , | )p TB denotes the likelihood, ( )p B denotes prior. The conditional probability ( ,|) ii p tB is defined as follows: ( ,|)(|)(|) (|)(|) iiii iiii iiiq np n q p lq p l p tbp dBp dB p dBp dB (3) with (|) ii q n p dB , (|) ii p n p dB , (|) ii q p l p dB and (|) ii q p l p dB defined as: max max (|)2 () (|)2 () (|)21 (|)21 iiiiiiii iiiiiiii iiiiii iiiiii q nq nq pq p p np nq pq p q p lq pq p q p lq pq p p dBdd p dBdd p dBd p dBd (4) here ,q p dist is the distance of binary codes q and p in Hamming space, (x) is the sigmoid function ( )1 (1 exp()xx, the last two formulations mean that we set minimum and maximum distances for two similar samples. From (4) we can see that, the larger the i iii q nq p distdist is, the larger (|) ii q n p dB will be, and (|) ii p n p dB is in the same way. Moreover, if ii q p distsatisfy2121 iiiiii q pq pq p d (|) ii q p l p dBand (|) ii q p l p dB will large, and vice versa. Therefore, (4) is reasonable according to the analysis. By (2) and taking the negative log likelihood, we proposed our loss function to learn hash codes as follows: , log ( ,|)log( ,|) ii ii tT Jp TBtB (5) Combining with (3) and (4), (5) can be rewritten as: 1 log(|) ji M j idstf JdsB (6) (|), (|), (|), (|) iiiiiiii iq np nq p lq p l tfp dBp dBp dBp dB (7) As mentioned before, the sigmoid activation function is utilized in the last fully connected layer. Thus, the model output i b for image i x is in the range of 0, 1. Inspired by 21, the constraint for maximizing sum of squared errors between i b and 0.5 makes the codes approach to either 0 or 1, that is 2 2 0.5 i be , where e is the L-dimensional vector with all elements being 1 and 2 2is l2 norm. Furthermore, by (6) and this constraint 2 2 0.5 i be , the loss function to quantize the hash codes, can be written as: 1 2 1 2 1 log (|)log (|) log (|)log (|) 1 0.5 iiii iiii M q np n i q p lq p l N i i Lp dBp dB M p dBp dB be L (9) where i q bis the hash code for image i q, is a hyper-parameter to balance the negative log likelihood and quantization error. 3783 A large will make model enter saturation region of sigmoid function too early, which will cause long training process. However, a small will make that the output of model is difficult to be binary. For iterative optimization, the derivative of the loss function L is: max , max , , , 22 () + 22 () + 21 (2) + 221 () iii iii ii iii iii ii iiii ii iiii ii q pq nq pnp tT n q pp nq ppq tT q pq pnqp tT q pq pqp tT L ddbb b ddbb dbbb dbb (10) This proposed loss function not only considers the relationship between ,q p d and ,q n d, but also constrains ,q p dto control the deep similar hierarchy for similar images. Experiments in next section verify the similar hierarchy with better performances for place recognition. IV. EXPERIMENTS We conduct extensive experiments on two widely used benchmark datasets: The Nordland dataset and The Gardens Point dataset. The Nordland dataset consists of video footage of 729 kilometers long train ride. 10 hours long video has been recorded from the perspective of the train driver four times and once in every season. So, there are four subsequences in this dataset corresponding to four seasons. In our experiments, image frames are extracted at 1 fps and images that are taken in tunnels or when train was stopped are excluded. 25% of all extracted images are used for training dataset and the others for test. The Gardens Point dataset is recorded on the Gardens Point Campus of QUT. It consists of three subsequences - one for the night and two for the day. One of the day traverses is recorded on the left side of the walkway, while the other day traverse and night traverse are recorded on the right side of walkway. Thus, the dataset exhibits both illumination and viewpoint changes. We use two subsequences as training data and the other for test. Recent work 8 uses 90% of total images as training data and left for testing. The proposed method in this paper can dig more information from

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

最新文档

评论

0/150

提交评论