IROS2019国际学术会议论文集1765_第1页
IROS2019国际学术会议论文集1765_第2页
IROS2019国际学术会议论文集1765_第3页
IROS2019国际学术会议论文集1765_第4页
IROS2019国际学术会议论文集1765_第5页
已阅读5页,还剩2页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

Talk to the Vehicle: Language Conditioned Autonomous Navigation of Self Driving Cars Sriram N. N.1, Tirth Maniar1, Jayaganesh Kalyanasundaram1, Vineet Gandhi1 Brojeshwar Bhowmick2, K Madhava Krishna1 AbstractWe propose a novel pipeline that blends encodings from natural language and 3D semantic maps obtained from visual imagery to generate local trajectories that are executed by a low-level controller. The pipeline precludes the need for a prior registered map through a local waypoint generator neural network. The waypoint generator network (WGN) maps semantics and natural language encodings (NLE) to local waypoints. A local planner then generates a trajectory from the ego location of the vehicle (an outdoor car in this case) to these locally generated waypoints while a low-level controller executes these plans faithfully. The effi cacy of the pipeline is verifi ed in the CARLA simulator environment as well as on local semantic maps built from real-world KITTI dataset. In both these environments (simulated and real-world) we show the ability of the WGN to generate waypoints accurately by mapping NLE of varying sequence lengths and levels of complexity. We compare with baseline approaches and show signifi cant performance gain over them. And fi nally, we show real implementations on our electric car verifying that the pipeline lends itself to practical and tangible realizations in uncontrolled outdoor settings. In loop execution of the proposed pipeline that involves repetitive invocations of the network is critical for any such language-based navigation framework. This effort successfully accomplishes this thereby bypassing the need for prior metric maps or strategies for metric level localization during traversal. I. INTRODUCTION Consider a complicated road and the driverless car is lost as it struggles to localize itself in the map provided. If it was a chauffeur-driven car instead, one could give simple instructions like “take right from the traffi c signal followed by second left” or “take fi rst right at the crossroad” and manage the situation. And if we could communicate with the car using natural language instructions, many such situations can be resolved with ease and it would be a giant step forward towards seamless integration of autonomous vehicles alongside humans. It will also make autonomous navigation more effi cient. For instance, traditional approaches to navigation have al- ways relied on offl ine metric maps 23 generated from runs apriori. Map based approach is consistent across a variety of applications such as in Collaborative Indoor Service Robots or CoBots 4, MOBOTS used in Museums18 and all the more so in outdoor autonomous driving scenarios, where companies have been spending extensively on getting detailed and up to date maps 7. However, it can be argued that human navigation uses maps minimally and yet is able 1 International Institute of Information Technology, Hyderabad, India. 2 TCS Research and Innovation Labs, Kolkata, India. *Equal contribution from the fi rst two authors. Fig. 1.Language based Waypoint Generation: The above instruction is given as input to the NLE network. (a) and (b) depict the output of WGN with a trajectory towards the predicted waypoint. (a) shows a straight road without turns where the network understands the scene and gives a straight waypoint. On the contrary, (b) shows an intersection where the network predicted a right waypoint for the same encoding from NLE network. to reach locations without much ado. For instance, given the instruction “keep following until you reach the end of the road and then take a right” the vehicle can navigate without the need of precise localization on the metric map every instant. Hence, a natural language based instruction can augment the capability of the vehicle to work seamlessly even when the localization is erroneous (due to GPS lag or errors) or the precise metric maps are not available. A fair amount of this stems from our ability to couple language with immediate semantic understanding to reach destinations accurately. For example humans can seamlessly interpret destination commands of the form, “Take the sec- ond left from here and a right at the third traffi c signal and your destination would be on the right” and execute it by integrating language encodings with the local semantic struc- ture. In this paper we propose a framework that captures this intuition of navigation that is driven/actuated by local seman- tics and language. While the semantics helps in decomposing the perceptual input into meaningful local representations, language helps in determining the intermediate and eventual goal over these semantic representations. The framework consists of three modules: the fi rst module is a natural language encoder (NLE), which takes as input the natural language instructions and translates them into high-level machine-readable codes/encodings. The second module is a Waypoint Generator Network (WGN), which combines the local semantic structure with the language encodings to predict the local waypoints (an example is 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) Macau, China, November 4-8, 2019 978-1-7281-4003-2/19/$31.00 2019 IEEE5284 illustrated in Fig. 1). The local semantic structure is defi ned by the occupancy map (obtained using depth maps) and the semantic segmentation maps (obtained using RGB image) of the current perspective of the car. The third module is a local planner that generates an obstacle avoiding trajectory to reach the locally generated waypoints and executes it by employing a low-level controller. The entire framework is summarized in Fig. 2 The pro- posed framework is modular, scalable and works in real time. We believe that our approach works at the right granularity in contrast to the previous works which either avoid the navigation part altogether 26 or directly couple the local structure information with steering control 20, which would require exponentially more training data to reduce noise or to tackle corner cases 22. The effi cacy of the pipeline is verifi ed thorough quantitative and qualitative results using the CARLA simulator environment as well as on local semantic maps built from real-world KITTI dataset. Formally, our work makes the following contributions: We propose a novel Language Conditioned Self Driving (LCSD) dataset, which consists of 140 ground truth trajectories from CARLA simulator with the corre- sponding natural language instructions for the end to end testing of language-based navigation task. A novel modular framework that provides accurate waypoints integrating Language Encodings with a Se- mantic Map. As we show in the experimental section by repeatedly invoking the WGN at the right granularity the agent (car) reaches its destination without the need for a prior map or accurate state estimation modules that depend on such maps. This stands in contrast with the prior art wherein the language command is processed only once and a global trajectory is planned to the eventual destination 12. However, the subsequent execution of such a trajectory is contingent on access to accurate maps and state of the art localization modules. The effi cacy of the pipeline is verifi ed in realistic synthetic datasets such as CARLA 9 and real-world KITTI dataset 10. We also show successful experi- mental runs on our Self Driving Electric Car. We compare with two versions of baseline architectures for WGN. The fi rst version precludes semantics and the second includes semantics but integrates the NLE into a post-processing module. We show that the proposed framework outperforms both the baselines by a signifi - cant margin. Most importantly we show ablation studies that portray the robustness of WGN to control noise that translates to signifi cant perturbations in viewing angle, which is now dominantly misaligned with the road direction. II. RELATEDWORK Previous works 16, 5, 26, 25, 17 have looked at the problem of taking language instructions and convert- ing them into a sequence of machine-readable high-level codes/encodings, where each code defi nes a different set of actions. These approaches pose the problem analogous to Machine Translation. Earlier approaches 16, 5 em- ploy statistical machine translation, while the recent ones frame it as an end-to-end encoder-decoder neural network architecture 17, 26, 25. These approaches are limited to known metric maps 23 or topological maps 14 and also assume that robots can realize valid navigation based on the instructions. In our work, we employ a similar sequence to sequence machine translation strategy, however, we also couple it with the actual navigation. Additionally, our work does not rely on a known environment. Another class of language-based navigation methods 20, 21 directly couple the steering control with the sensory input. The work by 20 builds upon the work by 26, by learning deep learning networks to imitate different navigation behaviors like entering the offi ce, follow corridor etc. A separate convolutional neural network is trained for navigating in each behavior, which signifi cantly increases the complexity of the problem. FollowNet 21 uses attention over language conditioned by the depth and visual inputs, to focus on the relevant part of the instruction at a given time. We take a modular approach and predict local waypoints instead of directly coupling the steering control with the sensory input, which may require an exponentially large amount of training data for precise measurements 22. Several datasets 1, 8, 6 for language based navi- gation have been proposed in the recent past. The work by 1 present a Room-to-Room (R2R) dataset for visually- grounded natural language navigation in real buildings, using the Matterport3D Simulator. They pose navigation as a selection problem among several pre-defi ned discrete choices at each state, which dilutes the actual navigation component. Another recent work 6 proposes a dataset for language based navigation in google street maps, however, their exper- iments focus only on the visual grounding i.e. to identify the point in the image that is referred to by the description. The work by Chen et al. 8 works in the setting of a tourist guide communication and the focus is on locating the tourist and giving him the right instructions to move towards the target. Our work augments their work, as it focuses on actually navigating the vehicle given the instructions. Another line of work 11, 12 has looked into identifying goals and constraints (admissible and inadmissible states in a known environment model) from natural language instruc- tions. The work by Howard et al. 11 uses a Distributed Correspondence Graph (DCG) to infer the most likely set of planning constraints. More recent work by Hu et al. 12 uses LSTM networks to classify individual phrases into the goal, constraints, and uninformative phrases. A global plan is then created by grounding the goal in a known environment. Local cost maps are then derived considering the goal and the constraints, which is used to compute a local collision avoidance navigation of the robot. The dependency on accurate sentence/phrase parsing is a major weakness of these approaches. Interestingly, most of the prior work has tackled the language based navigation problem only in indoor envi- ronments, which are either synthetically generated or/and 5285 Fig. 2. The overall pipeline of the proposed approach. The fi gure illustrates a predicted local waypoint (blue dot) by WGN given sensory inputs conditioned by the language encoding. The language encoding currently points to L (in red) i.e take next available left. The WGN predicts a straight waypoint. The planning and execution module, then predicts the trajectory to the local waypoint and executes it (the planned trajectory is shown with Green dots). known apriori. We solve the problem on a self driving car without prior knowledge about the map/environment. Furthermore, we present results on a CARLA simulator (synthetic environment), real world KITTI dataset, and an actual implementation on our electric car to thoroughly illustrate the effi cacy of our approach. III. METHOD The overall pipeline of our approach is illustrated in Fig. 2. The NLE module translates the natural language instructions into the sequence of machine-readable encodings as shown in Table I. For instance, an instruction “Take the fi rst left at the crossroad followed by the third right” is translated into L,NR,NR,R, i.e. take left, skip a right (not right), skip a right (not right) and then take a right. The WGN module predicts the next local waypoint given the input occupancy map and the currently pointed output of the language encoding. For instance, the pointer will be at L until a left turn is executed and the NLE pointer will move to NR once the turn is complete. Fig. 2 shows WGN predicting a straight waypoint (blue dot) when input from NLE is L, as there is no left available. The planning and execution module takes as input the occupancy map and the generated waypoints to predict and execute an obstacle avoiding trajectory towards the waypoint. The trajectory is planned using a RRT* algorithm 13. When a small arc- length of the trajectory is executed, a new waypoint is predicted and trajectories are replanned. We now provide a detailed description of the NLE and WGN modules. A. Natural Language Encoder The NLE module predicts a sequence of encodings given the natural language instructions. The sequence of encodings steers the high-level behavior of the autonomous vehicle. We pose this problem as a machine translation problem and train a LSTM network with attention mechanism 24, 15, 2 to solve it. We employ an encoder-decoder architecture similar to 24 using a two-layer LSTM with local attention. The input embeddings are of size 500 and the number of hidden units in each layer is set to 1024. The target vocabulary size is 8, corresponding to the eight behaviors considered in our work (illustrated in Table I). We use the framework by Rico et al. 19 for training the NLE module. We train our network using a dataset of 20,000 natural lan- guage instructions and their corresponding encodings (15000 sequences were used for training, 2500 for validation and 2500 for testing). The natural language instruction varies from the length of 2 words to 50 words and corresponding ground truth encoding sequences vary from the length of 1 to 15. To account for the high variability in the way people describe routes, our dataset contains a variety of explanations for the same sequence of behavioral encodings. For example, the instructions “skip the fi rst right and take the next one”, “you should take the second right, not the fi rst one” or “keep moving straight on the crossroads and then take a right” all correspond to sequence NR, R. Given the current output vocabulary size, our NLE module achieves near perfect test accuracy, which suggests studying the NLE module with richer output vocabulary. Such studies focused on the Natural Language Processing component, have been explored in previous works 25 and indicates that the machine to machine translation frameworks are conveniently scalable to larger vocabularies in predicting indoor behavioral graphs. Some example predictions of our NLE module are illustrated in Table II. 5286 TABLE I EIGHT HIGH LEVEL BEHAVIOURS INCLUDED IN OUR WORK. EncodingsMeaning L Take the fi rst left R Take the fi rst right NLNot left or go straight when left is available NRNot right or go straight when right is available TL Take the fi rst left at the traffi c signal TR Take the fi rst right at the traffi c signal TNL Not left at the traffi c signal TNR Not right at the traffi c signal TABLE II SOME EXAMPLES OF NATURAL LANGUAGE INSTRUCTIONS WITH THE PREDICTED ENCODINGS. InstructionEncoding Follow the road until the junction and take right followed by a left in the next crossroad and keep going straight till you see the traffi c signal and take a left at the signal. R L TL Please take the second left at the crossroad and then take right at the next intersection followed by a left. NL L R L Skip the fi rst right and take the next one post that, take a right at the traffi c signal. NR R TR B. Waypoint Generation Network Waypoint Generation network is the second module in our framework which predicts a probable goal point that can be used by an autonomous vehicle to traverse towards its intended direction. We propose two variants of our Waypoint Generation Network, the fi rst approach is based on an Encoder Decoder (E-D) architecture without language encoding as input where the network learns to forecast all plausible goal locations that the vehicle can take. We use this method as a baseline (Baseline 2) in our approach. We also train the same network without semantics and use it as a Baseline 1. The second variant is our proposed CNN+Dense (CNN-D) architecture where the prediction is conditioned on the language encoding and predicts only a single waypoint. 1) Inputs: The two variants of the Waypoint Generation Network have a semantically labeled occupancy map as input. The network takes in this semantic map O9of 9 channels where, each Oirepresents a binary activation map of a particular semantic label (road, buildings etc.). First, a 3D semantically labelled point cloud is generated and is projected on a 2D-grid in birds-eye view form to get a stack of mutually exclusive binary masks. Here we make use of 9 different semantics including unlabelled regions, that are common in an autonomous driving setting such as, Buildings, Road, Lanes, Vegetation, Vehicles, Traffi c Lights, Poles and Pedestrians. Therefore, the input to the baseline WGN E-D style is given by IED= O. The second variant of Wa

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

最新文档

评论

0/150

提交评论