iccv2019论文全集8850-fully-parameterized-quantile-function-for-distributional-reinforcement-learning

上传人：我*** IP属地：北京上传时间：2020-06-14 格式：PDF 页数：10 大小：548.84KB 积分：12 举报 版权申诉

iccv2019论文全集8850-fully-parameterized-quantile-function-for-distributional-reinforcement-learning_第2页

iccv2019论文全集8850-fully-parameterized-quantile-function-for-distributional-reinforcement-learning_第3页

iccv2019论文全集8850-fully-parameterized-quantile-function-for-distributional-reinforcement-learning_第4页

iccv2019论文全集8850-fully-parameterized-quantile-function-for-distributional-reinforcement-learning_第5页

已阅读5页，还剩5页未读，继续免费阅读

版权说明：本文档由用户提供并上传，收益归属内容提供方，若内容存在侵权，请进行举报或认领

文档简介

Fully Parameterized Quantile Function for Distributional Reinforcement Learning Derek Yang UC San Diego dyang1206 Li Zhao Microsoft Research lizo Zichuan Lin Tsinghua University linzc16 Tao Qin Microsoft Research taoqin Jiang Bian Microsoft Research jiang.bian Tieyan Liu Microsoft Research tyliu Abstract Distributional Reinforcement Learning (RL) differs from traditional RL in that, rather than the expectation of total returns, it estimates distributions and has achieved state-of-the-art performance on Atari Games. The key challenge in practical distributional RL algorithms lies in how to parameterize estimated dis- tributions so as to better approximate the true continuous distribution. Existing distributional RL algorithms parameterize either the probability side or the return value side of the distribution function, leaving the other side uniformly fi xed as in C51, QR-DQN or randomly sampled as in IQN. In this paper, we propose fully parameterized quantile function that parameterizes both the quantile fraction axis (i.e., the x-axis) and the value axis (i.e., y-axis) for distributional RL. Our algo- rithm contains a fraction proposal network that generates a discrete set of quantile fractions and a quantile value network that gives corresponding quantile values. The two networks are jointly trained to fi nd the best approximation of the true distribution. Experiments on 55 Atari Games show that our algorithm signifi cantly outperforms existing distributional RL algorithms and creates a new record for the Atari Learning Environment for non-distributed agents. 1Introduction Distributional reinforcement learning Jaquette et al., 1973, Sobel, 1982, White, 1988, Morimura et al., 2010, Bellemare et al., 2017 differs from value-based reinforcement learning in that, instead of focusing only on the expectation of the return, distributional reinforcement learning also takes the intrinsic randomness of returns within the framework into consideration Bellemare et al., 2017, Dabney et al., 2018b,a, Rowland et al., 2018. The randomness comes from both the environment itself and agents policy. Distributional RL algorithms characterize the total return as random variable and estimate the distribution of such random variable, while traditional Q-learning algorithms estimate only the mean (i.e., traditional value function) of such random variable. The main challenge of distributional RL algorithm is how to parameterize and approximate the distribution. In Categorical DQN Bellemare et al., 2017(C51), the possible returns are limited to a discrete set of fi xed values, and the probability of each value is learned through interacting with environments. C51 out-performs all previous variants of DQN on a set of 57 Atari 2600 games in the Arcade Learning Environment (ALE) Bellemare et al., 2013. Another approach for distributional reinforcement learning is to estimate the quantile values instead. Dabney et al. 2018b proposed QR- Contributed during internship at Microsoft Research. 33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada. DQN to compute the return quantiles on fi xed, uniform quantile fractions using quantile regression and minimize the quantile Huber loss Huber, 1964 between the Bellman updated distribution and current return distribution. Unlike C51, QR-DQN has no restrictions or bound for value and achieves signifi cant improvements over C51. However, both C51 and QR-DQN approximate the distribution function or quantile function on fi xed locations, either value or probability. Dabney et al. 2018a propose learning the quantile values for sampled quantile fractions rather than fi xed ones with an implicit quantile value network (IQN) that maps from quantile fractions to quantile values. With suffi cient network capacity and infi nite number of quantiles, IQN is able to approximate the full quantile function. However, it is impossible to have infi nite quantiles in practice. With limited number of quantile fractions, effi ciency and effectiveness of the samples must be reconsidered. The sampling method in IQN mainly helps training the implicit quantile value network rather than approximating the full quantile function, and thus there is no guarantee in that sampled probabilities would provide better quantile function approximation than fi xed probabilities. In this work, we extend the method in Dabney et al. 2018b and Dabney et al. 2018a and propose to fully parameterize the quantile function. By fully parameterization, we mean that unlike QR-DQN and IQN where quantile fractions are fi xed or sampled and only the corresponding quantile values are parameterized, both quantile fractions and corresponding quantile values in our algorithm are parameterized. In addition to a quantile value network similar to IQN that maps quantile fractions to corresponding quantile values, we propose a fraction proposal network that generates quantile fractions for each state-action pair. The fraction proposal network is trained so that as the true distribution is approximated, the1-Wasserstein distance between the approximated distribution and the true distribution is minimized. Given the proposed fractions generated by the fraction proposal network, we can learn the quantile value network by quantile regression. With self-adjusting fractions, we can approximate the true distribution better than with fi xed or sampled fractions. We begin with related works and backgrounds of distributional RL in Section 2. We describe our algorithm in Section 3 and provide experiment results of our algorithm on the ALE environ- ment Bellemare et al., 2013 in Section 4. At last, we discuss the future extension of our work, and conclude our work in Section 5. 2Background and Related Work We consider the standard reinforcement learning setting where agent-environment interactions are modeled as a Markov Decision Process(X,A,R,P,)Puterman, 1994, whereXandAdenote state space and action space,Pdenotes the transition probability given state and action,Rdenotes state and action dependent reward function and (0,1) denotes the reward discount factor. For a policy , defi ne the discounted return sum a random variable byZ(x,a) = P t=0 tR(xt,at), wherex0= x,a0= a,xt P (|xt1,at1)andat (|xt). The objective in reinforcement learning can be summarized as fi nding the optimalthat maximizes the expectation ofZ, the action-value functionQ(x,a) = EZ(x,a) . The most common approach is to fi nd the unique fi xed point of the Bellman optimality operator T Bellman, 1957: Q(x,a) = T Q(x,a) := ER(x,a) + EPmax a0 Q(x0,a0). To updateQ, which is approximated by a neural network in most deep reinforcement learning studies,Q-learning Watkins, 1989 iteratively trains the network by minimizing the squared temporal difference (TD) error defi ned by 2 t = ? rt+ max a0A Q(xt+1,a0) Q(xt,at) ?2 along the trajectory observed while the agent interacts with the environment following?-greedy policy. DQN Mnih et al., 2015 uses a convolutional neural network to representQand achieves human-level play on the Atari-57 benchmark. 2 2.1Distributional RL Instead of a scalarQ(x,a), distributional RL looks into the intrinsic randomness ofZby studying its distribution. The distributional Bellman operator for policy evaluation is Z(x,a) D = R(x,a) + Z(X0,A0), whereX0 P(|x,a)andA0 (|X0),A D = Bdenotes that random variableAandBfollow the same distribution. Both theory and algorithms have been established for distributional RL. In theory, the distribu- tional Bellman operator for policy evaluation is proved to be a contraction in thep-Wasserstein distance Bellemare et al., 2017. Bellemare et al. 2017 shows that C51 outperforms value-based RL, in addition Hessel et al. 2018 combined C51 with enhancements such as prioritized experience replay Schaul et al., 2016, n-step updates Sutton, 1988, and the dueling architecture Wang et al., 2016, leading to the Rainbow agent, current state-of-the-art in Atari-57 for non-distributed agents, while the distributed algorithm proposed by Kapturowski et al. 2018 achieves state-of-the-art per- formance for all agents. From an algorithmic perspective, it is impossible to represent the full space of probability distributions with a fi nite collection of parameters. Therefore the parameterization of quantile functions is usually the most crucial part in a general distributional RL algorithm. In C51, the true distribution is projected to a categorical distribution Bellemare et al., 2017 with fi xed values for parameterization. QR-DQN fi xes probabilities instead of values, and parameterizes the quantile values Dabney et al., 2018a while IQN randomly samples the probabilities Dabney et al., 2018a. We will introduce QR-DQN and IQN in Section 2.2, and extend from their work to ours. 2.2Quantile Regression for Distributional RL In contrast to C51 which estimates probabilities forN fi xed locations in return, QR-DQN Dabney et al., 2018b estimates the respected quantile values forN fi xed, uniform probabilities. In QR-DQN, the distribution of the random return is approximated by a uniform mixture of N Diracs, Z(x,a) := 1 N N X i=1 i(x,a), with each iassigned a quantile value trained with quantile regression. Based on QR-DQN, Dabney et al. 2018a propose using probabilities sampled from a base distribu- tion, e.g. U(0,1) , rather than fi xed probabilities. They further learn the quantile function that maps from embeddings of sampled probabilities to the corresponding quantiles, called implicit quan- tile value network (IQN). At the time of this writing, IQN achieves the state-or-the-art performance on Atari-57 benchmark, human-normalized mean and median of all agents that does not combine distributed RL, prioritized replay Schaul et al., 2016 and n-step update. Dabney et al. 2018a claimed that with enough network capacity, IQN is able to approximate to the full quantile function with infi nite number of quantile fractions. However, in practice one needs to use a fi nite number of quantile fractions to estimate action values for decision making, e.g. 32 randomly sampled quantile fractions as in Dabney et al. 2018a. With limited fractions, a natural question arises that, how to best utilize those fractions to fi nd the closest approximation of the true distribution? 3Our Algorithm We propose Fully parameterized Quantile Function (FQF) for Distributional RL. Our algorithm consists of two networks, the fraction proposal network that generates a set of quantile fractions for each state-action pair, and the quantile value network that maps probabilities to quantile values. We fi rst describe the fully parameterized quantile function in Section 3.1, with variables on both probability axis and value axis. Then, we show how to train the fraction proposal network in Section 3.2, and how to train the quantile value network with quantile regression in Section 3.3. Finally, we present our algorithm and describe the implementation details in Section 3.4. 3 3.1Fully Parameterized Quantile Function In FQF, we estimateNadjustable quantile values forNadjustable quantile fractions to approximate the quantile function. The distribution of the return is approximated by a weighted mixture ofN Diracs given by Z,(x,a) := N1 X i=0 (i+1 i)i(x,a),(1) wherezdenotes a Dirac atz R,1,.N1represent the N-1 adjustable fractions satisfying i1 i, with0= 0andN= 1to simplify notation. Denote quantile function Mller, 1997 F1 Z the inverse function of cumulative distribution functionFZ(z) = Pr(Z z) . By defi nition we have F1 Z (p) := inf z R : p FZ(z) where p is what we refer to as quantile fraction. Based on the distribution in Eq.(1), denote,the projection operator that projects quantile function onto a staircase function supported by and , the projected quantile function is given by F 1, Z () = ,F1 Z () = 0+ N1 X i=0 (i+1 i)Hi+1(), whereHis the Heaviside step function andH()is the short forH( ). Figure 1 gives an example of such projection. For each state-action pair(x,a) , we fi rst generate the set of fractions using the fraction proposal network, and then obtain the quantiles valuescorresponding tousing the quantile value network. To measure the distortion between approximated quantile function and the true quantile function, we use the 1-Wasserstein metric given by W1(Z,) = N1 X i=0 Z i+1 i ? ?F1 Z () i?d.(2) Unlike KL divergence used in C51 which considers only the probabilities of the outcomes, the p-Wasseretein metric takes both the probability and the distance between outcomes into consideration. Figure 1 illustrates the concept of how different approximations could affectW1error, and shows an example of W1. However, note that in practice Eq.(2) can not be obtained without bias. (a) (b) Figure 1: Two approximations of the same quantile function using different set ofwithN = 6, the area of the shaded region is equal to the 1-Wasserstein error. (a) Finely-adjustedwith minimized W1error. (b) Randomly chosen with larger W1error. 4 3.2Training fraction proposal Network To achieve minimal1 -Wasserstein error, we start from fi xing and fi nding the optimal corresponding quantile values. In QR-DQN, Dabney et al. 2018a gives an explicit form ofto achieve the goal. We extend it to our setting: Lemma 1.Dabney et al., 2018a For any1,.N1 0,1satisfyingi1 ifori, with1= 0 andN= 1, and cumulative distribution functionFwith inverseF1, the set ofminimizing Eq.(2) is given by i= F1 Z (i + i+1 2 )(3) We can now substitutei in Eq.(2) with equation Eq.(3) and fi nd the optimal condition forto minimize W1(Z,). For simplicity, we denote i= i+i+1 2 . Proposition 1.For any continuous quantile functionF1 Z that is non-decreasing, defi ne the 1- Wasserstein loss of F1 Z and F 1, Z by W1(Z,) = N1 X i=0 Z i+1 i ? ?F1 Z () F1 Z ( i)?d.(4) W1 i is given by W1 i = 2F1 Z (i) F1 Z ( i) F1 Z ( i1),(5) i (0,N). Further more, i1, i+1 0,1, i1 i+1, i (i1,i+1) s.t. W1 i = 0. Proof of proposition 1 is given in the appendix. While computingW1without bias is usually impractical, equation 5 provides us with a way to minimizeW1without computing it. Letw1be the parameters of the fraction proposal networkP, for an arbitrary quantile functionF1 Z , we can minimizeW1by iteratively applying gradients descent tow1according to Eq.(5) and convergence is guaranteed. As the true quantile functionF1 Z is unknown to us in practice, we use the quantile value network F1 Z,w2 with parameters w2for current state and action as true quantile function. The expected return, also known as action-value based on FQF is then given by Q(x,a) = N1 X i=0 (i+1 i)F1 Z,w2( i), where 0= 0 and N= 1. 3.3Training quantile value network With the properly chosen probabilities, we combine quantile regression and distributional Bellman update on the optimized probabilities to train the quantile function. ConsiderZa random variable denoting the action-value at(xt,at)andZ0the action-value random variable at(xt+1,at+1), the weighted temporal difference (TD) error for two probabilities iand j is defi ned by t ij = rt+ F1 Z0,w1( i) F 1 Z,w1( j) (6) Quantile regression is used in QR-DQN and IQN to stochastically adjust the quantile estimates so as to minimize the Wasserstein distance to a target distribution. We follow QR-DQN and IQN where quantile value networks are trained by minimizing the Huber quantile regression loss Huber, 1964, with threshold , (ij) = | Iij 0| L(ij) , with L(ij) = ? 1 2 2 ij, if |ij| ?| ij| 1 2 ? ,otherwise 5 The loss of the quantile value network is then given by L(xt,at,rt,xt+1) = 1 N N1 X i=0 N1 X j=0 j( t ij) (7) Note thatF1 Z and its Bellman target share the same proposed quantile fractions to reduce compu- tation. We perform joint gradient update for w1and w2, as illustrated in Algorithm 1. Algorithm 1: FQF update Parameter:N, Input: x,a,r,x0, 0,1) / Compute proposed fractions for x,a Pw1(x,a); / Compute proposed fractions for x0,a0 for a0 A do a 0 Pw1(x0,a0); end / Compute greedy action Q(s0,a0) PN1 i=0 (a 0 i+1 a0 i )F1 Z0,w2( a0 i ); a argmax a0 Q(s0,a0); / Compute L for 0 i N 1 do for 0 j N 1 do ij r + F1 Z0,w2( i) F 1 Z,w2( j) end end L = 1 N PN1 i=0 PN1 j=0 j(ij); / Compute W1 i for i 1,N 1 W1 i = 2F1 Z,w2(i) F 1 Z,w2( i) F 1 Z,w2( i1); Update w1with W1 i ; Update w2with L; Output: Q 3.4Implementation Details Our fraction proposal network is represented by one fully-connected MLP layer. It takes the state embedding of original IQN as input and generates fraction proposal. Recall that in Proposition 1, we requirei1 iand0= 0,N= 1. While it is feasible to have0= 0,N= 1 fi xed and sort the output ofw1, the sort operation would make the network hard to train. A more reasonable and practical way would be to let the neural network automatically have the output sorted using cumulated softmax. Letq RNdenote the output of a softmax layer, we haveqi (0,1),i 0,N 1and PN1 i=0 qi= 1. Leti= Pi1 j=0qj,i 0,N, then straightforwardly we havei DQN DQN221%79%240 PRIOR.580%124%3948 C51701%178%4050 RAINBOW1213%227%4252 QR-DQN902%193%4154 IQN1112%218%3954 FQF1426%272%4454 Table 1: Mean and median scores across 55 Atari 2600 games, measured as percentages of human baseline. Scores are averages over 3 seeds. Table 1 compares the mean and median human normalized scores across 55 Atari games with up to 30 random no-op starts, and the full score table is provided in the Appendix. It shows that FQF outperforms all existing distributional RL algorithms, including Rainbow Hessel et al., 2018 that combines C51 with prioritized replay, and

人人文库> 全部分类> 应用文书

温馨提示

1. 本站所有资源如无特殊说明，都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
2. 本站的文档不包含任何第三方提供的附件图纸等，如果需要附件，请联系上传者。文件的所有权益归上传用户所有。
3. 本站RAR压缩包中若带图纸，网页内容里面会有图纸预览，若没有图纸预览就没有图纸。
4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
5. 人人文库网仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对用户上传分享的文档内容本身不做任何修改或编辑，并不能对任何下载内容负责。
6. 下载文件中如有侵权或不适当内容，请与我们联系，我们立即纠正。
7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

iccv2019论文全集8850-fully-parameterized-quantile-function-for-distributional-reinforcement-learning

文档简介

温馨提示

最新文档

评论

iccv2019论文全集8850-fully-parameterized-quantile-function-for-distributional-reinforcement-learning

文档简介

温馨提示

最新文档

评论

相关文档