IROS2019国际学术会议论文集 0365_第1页
IROS2019国际学术会议论文集 0365_第2页
IROS2019国际学术会议论文集 0365_第3页
IROS2019国际学术会议论文集 0365_第4页
IROS2019国际学术会议论文集 0365_第5页
免费预览已结束,剩余1页可下载查看

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

Abstract Place recognition as one of the most significant requirements for long term simultaneous localization and mapping SLAM has been developed rapidly in recent years Also deep learning is proved to be more capable than traditional methods to extract features under some complex environments However in real world environments there are many challenging problems such as viewpoint changes and illumination changes The existing deep learning based place recognition in extracting feature phases and matching process is both time consuming Moreover features extracted from convolution neural network CNN are floating point type with high dimension In this paper we propose deep supervised hashing for place recognition where we design a similar hierarchy loss function to learn a model The model can distinguish the similar images more accurately which is well suitable to place recognition Besides the model can learn high quality hash codes by maximizing the likelihood of triplet labels Experiments on several benchmark datasets for place recognition show that our approach is robust to viewpoints illuminations and season changes with high accuracy Furthermore the trained model can extract features and match in real time on CPU with less memory consumption I INTRODUCTION SLAM has been more and more popular because of its application in robot and augmented reality 1 The task of visual place recognition is to recognize a place in the environment through vision sensors Place recognition in a loop closure detection for SLAM can be considered an image retrieval problem which matches current scene to a previously visited location 2 In past few years many algorithms have exploited this problem basing on comparing images as numerical vectors in bag of words space built on local features 3 5 Recent progress in the computer vision and machine learning has shown that features generated by convolutional neural networks outperform other hand crafted features in visual recognition classification and detection tasks 6 7 State of the art performance in place recognition can be achieved by utilizing features from CNNs 8 Hashing has attracted much attention because of the rapid growth of image and video data on the web It is one of the most popular techniques for image or video retrieval due to its low computational cost and high storage efficiency Briefly speaking hashing is used to encode high dimensional data into Lang Wu is with School of Software University of Chinese Academy of Sciences China corresponding author e mail yhwu a set of binary codes which preserves similarity of images Compared to Euclidean distance between two high dimensional image features Hamming distance of binary codes can speed up the matching process Thus it is one of the most popular and powerful techniques for approximate nearest neighbor ANN search However all hashing methods focus on classifying the images for retrieval problem There is no attempt on other tasks such as the place recognition In this paper we first analyze the difference between image retrieval and place recognition Then a similar hierarchy deep hashing is proposed to solve the place recognition task The contributions of this work are summarized as follows 1 We use a deep supervised hashing to solve place recognition problem with drastic viewpoint and illumination and environment changes 2 A novel method called similar hierarchy deep hashing is proposed to make deep hashing more suitable for place recognition 3 We design a novel triplet loss function 4 Experiments obtain competitive performance in benchmark dataset meanwhile the network can extract features and match in real time on CPU with less memory consumption II RELATED WORK We mainly utilize hashing technology to fulfill the place recognition Thus the related works are introduced from two aspects 1 place recognition 2 hashing A Place Recognition Place recognition based on appearance has obtained great attention in the robotics community In fact images corresponding to the same place may present entirely different appearance Many studies concentrate on visual bag of words by hand crafted local features 4 9 An example is FAB MAP 4 that is well suitable to gentle changes in viewpoint illumination and places Instead of calculating the most similar location from a given single image SeqSLAM 10 11 selects the best candidate to match place from image sequence This approach achieves remarkable results with condition change and even season change With success of CNN in recent years some works introduce CNN based place recognition and confirm that features extracted from CNNs can get superior performance 12 16 B Hashing Hashing methods for image retrieval can be grouped into two categories data independent methods and data dependent methods Data independent methods such as locality sensitive hashing LSH 17 need longer hash codes to attain high accuracy Data dependent methods refer to using training data to learn hash functions which can be categorized into supervised and unsupervised methods Compared to unsupervised methods supervised methods can get better Deep Supervised Hashing with Similar Hierarchy for Place Recognition Lang Wu and Yihong Wu 2019 IEEE RSJ International Conference on Intelligent Robots and Systems IROS Macau China November 4 8 2019 978 1 7281 4003 2 19 31 00 2019 IEEE3781 performance with fewer hash codes Recently hashing method based on deep learning have shown superior performance over the traditional hashing methods convolutional neural network hashing CNNH 18 is the first deep hashing method with supervised by triplet labels Network in network hashing NINH 19 presents a ranking loss improving performance of CNNH Some methods such as DSRH 20 and SSDH 21 utilize semantic information to learn hash function In some supervised hashing methods the label is in the form of pairwise 22 or triplet 19 23 that indicate semantic similarities between images Both DPSH and DTSH can learn hash function by maximizing a posterior and get superior performance III APPROACH In traditional image retrieval tasks the distance between two images from different classes inter class dshould be long while the distance between two images from a same class intra class dshould be short in a Hamming space The distances should satisfy inter classintra class dd where is the margin Place recognition is different from image retrieval It cannot be solved by traditional image retrieval methods Reasons are as follows 1 Images in place recognition are continuous without clear boundaries for different classes 2 The number of image classes in place recognition is not clear 3 Traditional image retrieval just focuses on whether two images are similar or not but ignores similarity degrees among similar images We propose similar hierarchical method which can distinguish the similar images Fig 1 a shows that continuous images from a video in place recognition task The image t x is similar to both 1t x and 2t x which means images t x 1t x and 2t x are in the same class while t x is more similar to 1t x On the other hand t x is dissimilar to 3t x and 2t x is similar to 3t x which means t x 2t x and 3t x are in different classes This is a contradiction The proposed similar hierarchy method can solve this contradiction Assume a robot has visited some places 1 N t t Xx where N denotes the number of total images Let 1 N t t Bb where 0 1 L t b denotes the corresponding hash code t denotes timestamp and L is the length of hash code In Fig 1 a the robot moves continuously so the image t x corresponding to location at time t is similar to 2t x 1t x 1t x and 2t x but far away from t d x and t d x etc where d is minimum time interval between two dissimilar images For example if two images are considered similar within 2 frames d is 3 Furthermore t x is closer to 1t x than to 2t x but both of them are far from t d x Obviously t b is more similar to 1t b than 2t b and both of them are dissimilar to t d b Two conditions need to be satisfied in place recognition by using a deep supervised hashing 1 In order to further distinguish similar images the distance between two hash codes needs to increase with decrease of similarity degree This property is called similar hierarchy 2 In order to use hash codes effectively the distance between any two hash codes of dissimilar images needs to be same Let s assume t x is similar to 1t x and 2t x but dissimilar to 3t x as shown in Fig 1 a Hash codes are shown in Fig 1 b where blue denotes 1 and yellow denotes 0 the number of different colors in corresponding positions denotes Hamming distance We can see that the Hamming distance between two adjacent locations is 2 where is a hyper parameter When 1 2 t t dist and 2 4 t t dist then condition 1 is satisfied where i j dist denotes the distance between i x and j x Meanwhile condition 2 is satisfied when 3t t dist 4t t dist 5t t dist all equal 6 Given the location at time t 2 1t x and 3t x are first retrieved and then t x and 4t x are retrieved later according to their hash codes Through above analysis the similar hierarchy can well solve the unclear boundaries problem in place recognition x xt tx xt t 1 1x xt t 2 2x xt t 3 3 x xt t d d a b bt t b bt t 1 1 b bt t 2 2 b bt t 3 3 b bt t 4 4 b bt t 5 5 b Fig 1 Images in place recognition and their hash codes a isan example of images in place recognition from a video sequence b denotes hash codes corresponding to continuousimages We use triplet network to handle the unclear number of image classes There are N image samples 1 N d N ii XxR M triplet labels 1 M iiiiiii Tq p nq p nX and M similarity degree 1 ii M q p i where i q is ith query image i pis the positive image sample i nis the negative image sample ii q p is similarity degree between i q and i p i q is more similar to the positive image i p than to the negative image i n The smaller similarity degree the more similar they are The aim is to learn hash codes 0 1 L NB where the length of binary codes is L The ith column 0 1 L i b of B denotes the binary code for the ith image Actually for a short video in a period of time the distance can be defined by time interval of two locations For two binary codes i b and j b the Hamming distance is formulated as 2 2 ij b bij distbb where 2 2is 2 l norm Our goal is to learn hash codes B while ii qp dist preserves the information ii qp Thus ii q n dist and ii p n distshould be larger than ii qp dist Most existing deep hashing methods are proposed to learn image features and hash codes from images 20 21 simultaneously In this paper we utilize triplet labels to learn above with an end to end manner As shown in Fig 2 the model consists of three components 1 image feature 3782 learning component 2 hash code learning component 3 loss function component CNN CNN CNN Loss Function W W Hash codes Hash codes Hash codes query positive negative Image Feature LearningHash learningLoss function q p n bq bp bn Fig 2 Overview of end to end manner to learn hashing Image Feature Learning This component is designed to produce the effective image features through CNNs Most previous hashing methods adopt AlexNet or CNN F architectures in this component MobileNet based on depthwise separable filters and pointwise separable filters was proposed in 24 Compared to AlexNet and CNN F MobileNet is more accurate with less model parameters and computation consumption We use MobileNet as feature extractor in place recognition and get real time performance on CPU For more detail information about MobileNet please refer to 24 Hash Code Learning This component is designed to learn hash codes The activation in MobileNet output layer for classification is replaced by sigmoid function In particular the number of nodes in the last fully connected layer equal to the length of the target hash codes Loss Function According to the previous analysis distance between two hash codes of images i and j can be expressed as max 2 if i and j is similar 2 if i and j is dissimilar ij i j dist 1 where i j denotes the similarity degree of two similar images max larger than denotes the threshold to distinguish similar and dissimilar images and is a constant which can control the Hamming distance of two hash codes corresponding to similar images As mentioned above t x is similar to 1t x and 2t x but dissimilar to 3t x In this paper we choose max is 3 which will be explained in Section IV Different to the traditional triplet loss function the proposed method can adjust the Hamming distance of two similar images through similarity which is more appropriate to a continuous process Inspired by DPSH 22 and DTSH 23 we design our loss function based likelihood Given the triplet labels T and distance truth Maximum a Posterior MAP estimation of hash codes can be represented as ii ii tT p TB p B p B Tp TB p B p T p tB p B 2 where p TB denotes the likelihood p B denotes prior The conditional probability ii p tB is defined as follows iiii iiii iiiq np n q p lq p l p tbp dBp dB p dBp dB 3 with ii q n p dB ii p n p dB ii q p l p dB and ii q p l p dB defined as max max 2 2 21 21 iiiiiiii iiiiiiii iiiiii iiiiii q nq nq pq p p np nq pq p q p lq pq p q p lq pq p p dBdd p dBdd p dBd p dBd 4 here q p dist is the distance of binary codes q and p in Hamming space x is the sigmoid function 1 1 exp xx the last two formulations mean that we set minimum and maximum distances for two similar samples From 4 we can see that the larger the i iii q nq p distdist is the larger ii q n p dB will be and ii p n p dB is in the same way Moreover if ii q p distsatisfy2121 iiiiii q pq pq p d ii q p l p dBand ii q p l p dB will large and vice versa Therefore 4 is reasonable according to the analysis By 2 and taking the negative log likelihood we proposed our loss function to learn hash codes as follows log log ii ii tT Jp TBtB 5 Combining with 3 and 4 5 can be rewritten as 1 log ji M j idstf JdsB 6 iiiiiiii iq np nq p lq p l tfp dBp dBp dBp dB 7 As mentioned before the sigmoid activation function is utilized in the last fully connected layer Thus the model output i b for image i x is in the range of 0 1 Inspired by 21 the constraint for maximizing sum of squared errors between i b and 0 5 makes the codes approach to either 0 or 1 that is 2 2 0 5 i be where e is the L dimensional vector with all elements being 1 and 2 2is l2 norm Furthermore by 6 and this constraint 2 2 0 5 i be the loss function to quantize the hash codes can be written as 1 2 1 2 1 log log log log 1 0 5 iiii iiii M q np n i q p lq p l N i i Lp dBp dB M p dBp dB be L 9 where i q bis the hash code for image i q is a hyper parameter to balance the negative log likelihood and quantization error 3783 A large will make model enter saturation region of sigmoid function too early which will cause long training process However a small will make that the output of model is difficult to be binary For iterative optimization the derivative of the loss function L is max max 2 2 2 2 21 2 2 21 iii iii ii iii iii ii iiii ii iiii ii q pq nq pnp tT n q pp nq ppq tT q pq pnqp tT q pq pqp tT L ddbb b ddbb dbbb dbb 10 This proposed loss function not only considers the relationship between q p d and q n d but also constrains q p dto control the deep similar hierarchy for similar images Experiments in next section verify the similar hierarchy with better performances for place recognition IV EXPERIMENTS We conduct extensive experiments on two widely used benchmark datasets The Nordland dataset and The Gardens Point dataset The Nordland dataset consists of video footage of 729 kilometers long train ride 10 hours long video has been recorded from the perspective of the train driver four times and once in every season So there are four subsequences in this dataset corresponding to four seasons In our experiments image frames are extracted at 1 fps and images that are taken in tunnels or when train was stopped are excluded 25 of all extracted images are used for training dataset and the others for test The Gardens Point dataset is recorded on the Gardens Point Campus of QUT It consists of three subsequences one for the night and two for the day One of the day traverses is recorded on the left side of the walkway while the other day traverse and night traverse are recorded on the right side of walkway Thus the dataset exhibits both illumination and viewpoint changes We use two subsequences as training data and the other for test Recent work 8 uses 90 of total images as training data and left for testing The proposed method in this paper can dig more information from continuous image sequence only 25 as training data in our experiments and left for testing Images are extracted from video at 1 fps we found that images are similar within two frames max is set as 3 in the experiments Place recognition is performed by single image nearest neighbor based Hamming distance of the hash codes Performances are evaluated using precision and recall relationship via precision recall curve and F1 scores To be fair we use the same conditions as in 13 A match is a candidate positive if it passes a ratio test ratio of the distances of the best over the second best match found in the nearest neighbor search and negative otherwise A match is a true positive when it is within 1 frame of ground truth The precision recall curve is created by controlling the threshold of ratio test We compare the performance of our method with those in 10 11 and 13 We evaluate our method by tensorflow 1 10 on i5 5200u 2 50GHz CPU and Geforce GTX 1080 Ti GPU A Performance on Nordland Dataset a b Fig 3 Performance of the proposed method on Nordland dataset a Images from left to right are examples from a spring summer fall and winter subsequence of Nordland dataset respectively b P R curve of the proposed method on Nordland dataset In this section we evaluate the robustness of our method for place recognition on the Nordland dataset All images are resized to 224 224 and normalized to 0 1 before feeding i

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论