版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
1、Multiple Futures PredictionYichuanTangRuslan Salakhutdinovrsalakhutdinovyichuan_tangAbstractTemporal prediction is critical for making intelligent and robust decisions in com- plex dynamic environments. Motion prediction needs to m the inherently uncertain future which often contains multiple potent
2、ial outcomes, due to multi- agent interactions and the latent goals of others. Towards these goals, we introduce a probabilistic framework that efficiently learns latent variables to jointly mthe multi-step future motions of agents in a scene. Our framework is data-driven and learns sem ally meaning
3、ful latent variables to represent the multimodal future, without requiring explicit labels. Using a dynamic attention-based state encoder, we learn to encode the past as well as the future interactions among agents, efficiently scaling to any number of agents. Finally, our m can be used for planning
4、 via computing a conditional probability density over the trajectories of other agents given a hypothetical rollout of the self agent. We demonstrate our algorithms by predicting vehicle trajectories of both simulated and real data, demonstrating the state-of-the-art results on several vehicle traje
5、ctory datasets.1IntroductionThe ability to make good predictions lies at the heart of robust and safe decision making.It isespecially critical to be able to predict the future motions of all relevant agents in complex anddynamic environments. For example, in the autonomous driving domain, motion pre
6、diction is central both to the ability to make high level decisions, such as when to perform maneuvers, as well as to low level path planning optimizations 34, 28.Motion prediction is a challenging problem due to the various needs of a good predictive m.The varying objectives, goals, and behavioral
7、characteristics of different agents can lead to multiplepossible futures or modes. Agents states do not evolve independently from one another, but rather they interact with each other. As an illustration, we provide some examples in Fig. 1. In Fig. 1(a), there are a few different possible futures fo
8、r the blue vehicle approaching an intersection. It can either turn left, go straight, or turn right, fordifferent modes in trajectory space. In Fig. 1(b), interactions between the two vehicles during a merge scenario show that their trajectories influence each other, depending on who yields to whom.
9、 Besides multimodal interactions, prediction needsto scale efficiently with an arbitrary number of agents in a scene and take intoauxiliaryand contextual information, such as map and road information. Additionally, the ability to measureuncertainty by computing probability over likely future traject
10、ories of all agents ind-form (as opposed to Monte Carlo sampling) is of practical importance.Despite a large body of work in temporal motion predictions 24, 7, 13, 26, 16, 2, 30, 8, 39, existing state-of-the-art methods often only capture a subset of the aforementioned features. For example, algorit
11、hms are either deterministic, not multimodal, or do not fully capture both past and future interactions. Multimodal techniques often require the explicit labeling of modes prior to training. M s which perform joint prediction often assume the number of agents present to be fixed 36, 31.We tackle the
12、se challenges by proposing a unifying framework that captures all of the desirable features mentioned earlier. Our framework, which we call Multiple Futures Predictor (MFP), is33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.(b) Scenario A: green yields to b
13、lue.(c) Scenario B: blue yields to green.(a) Multiple possible future trajectories.Figure 1: Examples illustrating the need for mutimodal interactive predictions. (a): There are a few possible modes for the blue vehicle. (b and c): Time-lapsed visualization of how interactions between agents influen
14、ces each others trajectories.a sequential probabilistic latent variable generative mthat learns directly from multi-agenttrajectory data. Trainingizes a variational lower bound on the log-likelihood of the data. MFPlearns to mmultimodal interactive futures jointly for all agents, while using a novel
15、 factorizationtechnique to remain scalable to arbitrary number of agents. After training, MFP can compute both (un)conditional trajectory probabilities in d form, not requiring any Monte Carlo sampling.MFP builds on the Seq2seq 32, encoder-decoder framework by introducing latent variables and using
16、a set of parallel RNNs (with shared weights) to represent the set of agents in a scene. Each RNN takes on the point-of-view of its agent and aggregates historical information for sequential temporal prediction for that agent. Discrete latent variables, one per RNN, automatically learn sem ally meani
17、ngful modes to capture multimodality without explicit labeling. MFP can be further efficiently and jointly trained end-to-end for all agents in the scene. To summarize, we make the following contributions with the proposed MFP: First, sem ally meaningful latent variables are automatically learned fr
18、om trajectory data without labels. This addresses the multimodality problem. Second, interactive and parallel step-wise rollouts are preformed for all agents in the scene. This addresses the m ing of interactions between actors during future prediction, see Sec. 3.1. We further propose a dynamic att
19、entional encoding which captures both the relationships between agents and the scene context, see Sec. 3.1. Finally, MFP is capable of perforhypothetical inference: evaluating the conditional probability of agents trajectories conditioning on fixing one or more agents trajectory, see Sec. 3.2.2Relat
20、ed WorkThe problem of predicting future motion for dynamic agents has been well studied in the literature. The bulk of classical methods focus on using physics based dynamic or kinematic m s 38, 21, 25. These approaches include Kalman filters and maneuver based methods, which compute the future moti
21、on of agents by propagating their current state forward in time. While these methods perform well for short time horizons, longer horizons suffer due to the lack of interaction and context m ing.The success of machine learning and deep learning ushered in a variety of data-driven recurrent neural ne
22、twork (RNN) based methods 24, 7, 13, 26, 16, 2. These m s often combine RNN variants, such as LSTMs or GRUs, with encoder-decoder architectures such as conditional variational autoencoders (CVAEs). These methods eschew physic based dynamic m s in favor of learning generic sequential predictors (e.g.
23、 RNNs) directly from data. Converting raw input data to input features can also be learned, often by encoding rasterized inputs usings 7, 13.Methods that can learn multiple future modes have been proposed in 16, 24, 13. However, 16 explicitly labels six maneuvers/modes and learn to separately classi
24、fy these modes. 24, 13 do notrequire mabeling but they also do not train in an end-to-end fashion byizing the datalog-likelihood of the m. Most of the methods in literature encode the past interactions of agentsin a scene, however prediction is often an independent rollout of a decoder RNN, independ
25、ent of other future predicted trajectories 16, 29. Encoding of spatial relationships is often done by placing other agents in a fixed and spatially discretized grid 16, 24.2(a) Graphical mof the MFP. Solidnodes denote observed. Cross agentinteraction edges are shaded for clarity.(b) Architecture of
26、the proposed MFP. Circular world contains thext denotes both the state and contextual information from timesteps 1 to t.world state and positions of all agents. Diamond nodes are determin- istic while the circular zn are discrete latent random variables.Figure 2: Graphical mand computation graph of
27、the MFP. See text fors. Best viewed in color.In contrast, MFP proposes a unifying framework which exhibits the aforementioned features. To summarize, we present a feature comparison of MFP with some of the recent methods in the supplementary materials.3Multiple Futures PredictionWe tackle motion pre
28、diction by formulating a probabilistic framework of continuous space but discretetime system with a finite (but variable) numb.er of N interacting agents. We represent the joint stateof all N agents at time t as Xt RN d12N, where d is the dimensionality of= x , x , . . . , x ttteach state1, and xn .
29、Rd is the state n-th agent at time t. With a slight abuse of notation, wetuse superscripted Xn = x, xn, . . . , x to denote the past states of the n-th agent andnntt +1t.X = Xto denote the joint agent states from time t to t, where is the past history steps.1:Nt :t.The future state at time of all ag
30、ents is denoted by Y = y , y , . . . , y and the future trajectory12N.of agent n, from time t to time T , is denoted by Y = .denotesnnnn1:Ny , y, . . . , y TY = Yt:t+Ttt+1the joint state of all agents for the future timesteps. Contextual scene information, e.g. a rasterizedimage Rhw3 of the map, cou
31、ld be useful by providing important cues. We use It to represent anycontextual information at time t.The goal of motion prediction is then to accurately mp(Y|X, It). As in most sequentialmling tasks, it is both inefficient and intractable to mp(Y|X, It) jointly. RNNs are typicallyemployed to sequent
32、ially mthe distribution in a cascade form. However, there are two majorchallenges specific to our multi-agent prediction framework: (1) Multimodality: optimizing vanilla RNNs via backpropagation through time will lead to mode-averaging since the map from X to Y is not a function, but rather a one-to
33、-many map . In other words, multimodality means that for a given X, there could be multiple distinctive modes that results in significant probabilitydistribution over different sequences of Y. (2) Variable-Agents: the number of agents N is variable and unknown, and therefore we can not simply vector
34、ize Xt as the input to a standard RNN at time t.For multimodality, we introduce a set of stochastic latent variables zn Multinoulli(K), oneper agent, where zn can take on K discrete values. The intuition here is that zn would learn to represent intentions (left/right/straight) and/or behavior modes
35、(aggressive/conservative). Learning izes the marginalized distribution, where z isto learn any latent behavior so long as ithelps to improve the data log-likelihood. Each z is conditioned on X at the current time (beforefuture prediction) and will influence the distribution over future states Y. A k
36、ey feature of the MFP is that zn is only sampled once at time t, and must be consistent for the next T time steps. Compared to sampling zn at every timestep, this leads to a tractability and more realistic intention/goal m ing,1We assume states are fully observable and are agents (x, y) coordinates
37、on the ground plane (d=2).3as we will discuss in morelater. We now arrive at the following distribution:XX(1)log p(Y|X, I) = log(p(Y, Z|X, I) = log(p(Y|Z, X, I)p(Z|X, I),ZZwhere Z denotes the joint latent variables of all agents. Navely optimizing for Eq. 1 is prohibitively expensive and not scalabl
38、e as the number of agents and timesteps may become large. In addition,the max number of possible modes is exponential: O(KN ). We first make the mmore tractableby factorizing across time, followed by factorization across agents. The joint future distribution Y assumes the form of product of conditio
39、nal distributions:YT(2)p(Y|Z, X, I) =p(Y|Yt:1, Z, X, I),=t+1YNnn(3)p(Y |Y, Z, X, I) =p(y |Y, z , X, I).t:1t:1n=1The second factorization is sensible as the factorial component is conditioning on the joint states of all agents in the immediate previous timestep, where the typical temporal delta is ve
40、ry short (e.g. 100ms). Also note that the future distribution of the n-th agent is explicitly dependent on its ownmode zn but implicitly dependent on the latent modes of other agents by re-encoding the other agentspredicted states ym (please see discussion later and also Sec. 3.1). Explicitly condit
41、ioning an agentsown latent modes is both more scalable computationally as well as more realistic: agents in the real-world can only infer other agents latent goals/intentions via observing their states. Finally our overall objective from Eq. 1 can be written as: XX Y YTN n, z , X, I)p(z |X, I)(4)nnl
42、ogp(Y|Z, X, I)p(Z|X, I) = logp(y |Yt:1Z =t+1 n=1Z NTX YYp(zn|X, I)p(y |Ynn, z , X, I)(5)= logt:1Z n=1=t+1The graphical mof the MFP is illustrated in Fig. 2a. While we show only three agents forsimplicity, MFP can easily scale to any number of agents. Nonlinear interactions among agents makesn, X, I)
43、 complicated to m. The class of recurrent neural networks are powerful andp(y |Yt:1flexible ms that can efficiently capture and represent long-term dependences in sequential data.At a high level, RNNs introduce deterministic hidden units ht at every timestep t, which a features or embeddings that su
44、mmarize all of the observations up until time t. At time step t, a RNNtakes as its input the observation, xt, and the previous hidden representation, ht1, and computes theupdate: ht = frnn(xt, ht1). The prediction yt is computed from the decoding layer of the RNNyt = fdec(ht). frnn and fdec are recu
45、rsively applied at every timestep of the sequence.Fig. 2b shows the computation graph of the MFP. A point-of-view (PoV) transformation n(Xt) is first used to transform the past states to each agents own reference frame by translation and rotation such that +x-axis aligns with agents heading. We then
46、 inst ate an encoding and a decoding RNN2per agent. Each encoding RNN is responsible for encoding the past observations xt:t into a featurevector. Scene context is transformed via a convolutional neural network into its own feature. The features are combined via a dynamic attention encoder,ed in Sec
47、. 3.1, to provide inputs both to the latent variables as well as to the ensuing decoding RNNs. During predictive rollouts, the decoding RNN will predict its own agents state at every timestep. The predictions will be aggregatedand subsequently transformed via n(), providing inputs to every agent/RNN
48、 for the next timestep.Latent variables Z provide extra inputs to the decoding RNNs to enable multimodality. Finally, theoutput yn consists of a 5 dim vector governing a Bivariate Normal distribution: x, y, x, y, andtcorrelation coefficient .While we instate two RNNs per agent, these RNNs share the
49、same parameters across agents, whichmeans we can efficiently perform joint predictions by combining inputs in a minibatch, allowing us to scale to arbitrary number of agents. Making Z discrete and having only ot of latent variablesinfluencing subsequent predictions is also a deliberate choice. We wo
50、uld like Z to mmodes generated due to high level intentions such as left/right lane changes or conservative/aggressive modesof agent behavior. These latent behavior modes also tend to stay consistent over the time horizon which is typical of motion prediction (e.g. 5 seconds).2We use GRUs 10. LSTMs
51、and GRUs perform similarly, but GRUs were slightly faster computationally.4LearningGiven a set of training trajectory data D = (X(i), Y(i), ) . . . i=1,2,.,|D|, we optimize using theum likelihood estimation (MLE) to estimate the parameters = argmax L(, D) thatum marginal data log-likelihood:3achieve
52、s theXX p(Y, Z|X; )(6)L(, D) = log p(Y|X; ) = logp(Y, Z|X; ) =p(Z|Y, X; ) logp(Z|Y, X; )ZZOptimizing for Eq. 6 directly is non-trivial as the posterior distribution is not only hard to compute, but also varies with . We can however decompose the log-likelihood into the sum of the evidence lower boun
53、d (ELBO) and the KL-divergence between the true posterior and an approximating posterior q(Z) 27:Xp(Y, Z|X; )q(Z|Y, X)log p(Y|X; ) =q(Z|Y, X) log+ D(q|p)KLZX(7)q(Z|Y, X) log p(Y, Z|X; ) + H(q),Zwhere Jensens inequality is used to arrive at the lower bound, H is the entropy function andDKL(q|p) is th
54、e KL-divergence between the true and approximating posterior. We learn by max-imizing the variational lower bound on the data log-likeliho.od by first using the true posterior4 atthe current 0 as the approximating posterior: q(Z|Y, X) = p(Z|Y, X; 0). We can then fix theapproximate posterior and opti
55、mize the mparameters for the following function:X00Q(, ) =p(Z|Y, X; ) log p(Y, Z|X; )ZX 0(8)=p(Z|Y, X; ) log p(Y|Z, X; ) + log p(Z|X; ) .rnnZZwhere = rnn, Z denote the parameters of the RNNs and the parameters of the network layersfor predicting Z. As our latent variables Z are discrete and have sma
56、ll cardinality (e.g. 10), we can compute the posterior exactly for a given 0. The RNN parameter gradients are computed fromQ(, 0)/rnn and the gradient for Z is KL(p(Z|Y, X; 0)|p(Z|X; Z)/Z.Our learning algorithm is a form of the EM algorithm 14, where for the M-step we optimize RNN parameters using s
57、tochastic gradient descent. By integrating out the latent variable Z, MFP learns directly from trajectory data, without requiring any annotations or weak supervision for latent modes.We provide aed training algorithm pseudocode in the supplementary materials.Classmates-forcingTeacher forcing is a standard technique (albeit biased) to accelerate RNN and sequence-to-sequence training by using
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- 北海市交通运输综合行政执法支队2025年招聘备考题库及答案详解(易错题)
- 四川电力设计咨询有限责任公司2026届秋季招聘125人备考题库及答案详解一套
- 2026年中建三局北京公司总部职能管理岗位校园招聘备考题库及答案详解(易错题)
- 青岛市卫生健康委员会直属事业单位校园招聘2026届高校毕业生备考题库及答案详解(夺冠系列)
- 2026年宁津县人民医院招聘洗衣房护工1人备考题库(含答案详解)
- 2026年校园招聘中国电子科技集团公司第十一研究所招聘备考题库及一套参考答案详解
- 2026年机场集团工程建设指挥部招聘备考题库及参考答案详解1套
- 2026年盐城市交通运输局部分直属单位公开招聘事业性质人员备考题库及答案详解(夺冠系列)
- 2026年鄂州市华容区属国有企业招聘备考题库及完整答案详解
- 中国石油大学(北京)2026年非教师岗位人员招聘备考题库及1套完整答案详解
- 请结合材料理论联系实际分析如何正确评价人生价值?人生价值的实现需要哪些条件?参考答案
- 2026年党支部主题党日活动方案
- 2025年福鼎时代面试题及答案
- 幼儿园中班交通安全教育课件
- 2025 年国家层面数据资产政策汇编(全景解读版)
- (2026)黄金尾矿处理综合利用建设项目可行性研究报告(一)
- 2024-2025学年广东省深圳市福田区七年级(上)期末英语试卷
- 《证券投资学》吴晓求课后习题答案
- 消防员心理测试题目及答案大全2025
- 住院医师规范化培训急诊科模拟试题及答案
- 铝锭贸易专业知识培训课件
评论
0/150
提交评论