已阅读5页,还剩4页未读, 继续免费阅读
版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
Backpropagation-Friendly Eigendecomposition Wei Wang1, Zheng Dang2, Yinlin Hu1, Pascal Fua1, and Mathieu Salzmann1 1CVLab, EPFL, CH-1015 Lausanne, Switzerland first.lastepfl.ch 2Xian Jiaotong University, China dangzheng713 Abstract Eigendecomposition (ED) is widely used in deep networks. However, the backprop- agation of its results tends to be numerically unstable, whether using ED directly or approximating it with the Power Iteration method, particularly when dealing with large matrices. While this can be mitigated by partitioning the data in small and arbitrary groups, doing so has no theoretical basis and makes its impossible to exploit the power of ED to the full. In this paper, we introduce a numerically stable and differentiable approach to leveraging eigenvectors in deep networks. It can handle large matrices without requiring to split them. We demonstrate the better robustness of our approach over standard ED and PI for ZCA whitening, an alternative to batch normalization, and for PCA denoising, which we introduce as a new normalization strategy for deep networks, aiming to further denoise the networks features. 1Introduction In recent years, Eigendecomposition (ED) has been incorporated in deep learning algorithms to perform PCA whitening 1, ZCA whitening 2, second-order pooling 3, generative modeling 4,5, keypoint detection and matching 68, pose estimation 9, camera self-calibration 10, and graph matching 11. This requires backpropagating the loss through the ED, which can be done but is often unstable. This is because, as shown in 3, the partial derivatives of an ED-based loss depend on a matrix e K with elements e Kij= ? 1/(ij), i 6= j 0,i = j ,(1) whereidenotes theitheigenvalue of the matrix being processed. Thus, when two eigenvalues are very close, the partial derivatives become very large, causing arithmetic overfl ow. The Power Iteration (PI) method 12 is one way around this problem. In its standard form, PI relies on an iterative procedure to approximate the dominant eigenvector of a matrix starting from an initial estimate of this vector. This has been successfully used for graph matching 11 and spectral normalization in generative models 4, which only require the largest eigenvalue. In theory, PI can be used in conjunction with a defl ation procedure 13 to fi nd all eigenvalues and eigenvectors. This involves computing the dominant eigenvector, removing the projection of the input matrix on this vector, and iterating. Unfortunately, two eigenvalues being close to each other or one being close to zero can trigger large round-off errors that eventually accumulate and result in inaccurate eigenvector estimates. The results are also sensitive to the number of iterations and to how the vector is initialized at the start of each defl ation step. Finally, the convergence speed decreases signifi cantly when the ratio between the dominant eigenvalue and others becomes close to one. In short, both SVD 3 and PI 11 are unsuitable for use in a deep network that requires the computation of gradients to perform back-propagation. This is particularly true when dealing with large matrices for which the chances of two eigenvalues being almost equal is larger than for small 33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada. ones. This why another popular way to get around these diffi culties is to use smaller matrices, for example by splitting the feature channels into smaller groups before computing covariance matrices, as in 2 for ZCA whitening 14,15. This, however, imposes arbitrary constraints on learning. They are not theoretically justifi ed and may degrade performance. In this paper, we therefore introduce a numerically stable and differentiable approach to performing ED within deep networks in such as way that their training is robust. To this end, we leverage the fact that the forward pass of ED is stable and yields accurate eigenvector estimates. As the aforementioned problems come from the backward pass, once the forward pass is complete we rely on PI for backprogation purposes, leveraging the ED results for initialization. We will show that our hybrid training procedure consistently outperforms both ED and PI in terms of stability for ZCA whitening 2: An alternative to Batch Normalization 16, which involves linearly transforming the feature vectors so that their covariance matrices becomes the identity matrix and that they can be considered as decoupled. PCA denoising 17: A transformation of the data that reduces the dimensionality of the input features by projecting them on a subspace spanned by a subset of the eigenvectors of their covariance matrix, and projects them back to the original space. Here, we introduce PCA denoising as a new normalization strategy for deep networks, aiming to remove the irrelevant signal from their feature maps. Exploiting the full power of both these techniques requires performing ED on relatively large matrices and back-propagating the results, which our technique allows whereas competing ones tend to fail. The code is available at 2Numerically Stable Differentiable Eigendecomposition Given an input matrixM, there are two standard ways to exploit the ED ofMwithin a deep network: 1. Perform ED using SVD or QR decomposition and use analytical gradients for backpropagation. 2. Given randomly-chosen initial guesses for the eigenvectors, run a PI defl ation procedure during the forward pass and compute the corresponding gradients for backpropagation purposes. Unfortunately, as discussed above, the fi rst option is prone to gradient explosion when two or more eigenvalues are close to each other while the accuracy of the second strongly depends on the initial vectors and on the number of iterations in each step of the defl ation procedure. This can be problematic because the PI convergence rate depends geometrically on the ratio|2/1|of the two largest eigenvalues. Therefore, when this ratio is close to 1, the eigenvector estimate may be inaccurate when performing a limited number of iterations. This can lead to training divergence and eventual failure, as we will see in the results section. Our solution is to rely on the following hybrid strategy: 1.Use SVD during the forward pass because, by relying on a divide-and-conquer strategy, it tends to be numerically more stable than QR decomposition 12. 2.Compute the gradients for backpropagation from the PI derivations, but using the SVD-computed vectors for initialization purposes. In the remainder of this section, we show that the resulting PI gradients not only converge to the analytical ED ones, but are bounded from above by a factor depending on the number of PI iterations, thus preventing their explosion in practice. In the results section, we will empirically confi rm this by showing that training an ED-based deep network using our approach consistently converges, whereas the standard SVD and PI algorithms often diverge. 2.1Power Iteration Gradients LetM be a covariance matrix, and therefore be positive semi-defi nite and symmetric. We now focus on the leading eigenvector ofM . Since the defl ation procedure simply iteratively removes the projection ofMon its leading eigenvector, the following derivations remain valid at any step of this procedure. To compute the leading eigenvector v of M, PI relies on the iterative update v(k)=Mv (k1)/kMv(k1)k ,(2) 2 where kk denotes the 2norm. The PI gradients can then be computed as 18. L M= K1 X k=0 ?Iv(k+1)v(k+1)? ? ?Mv(k)? L v(k+1) v(k), L v(k) =M ?Iv(k+1)v(k+1)? ? ?Mv(k)? L v(k+1) . (3) Typically, to initialize PI,v(0)is taken as a random vector such thatkv(0)k=1. Here, however, we rely on SVD to compute the true eigenvectorv. Becausevis an accurate estimate of the eigenvector, feeding it as initial value in PI will yieldv=v(0)v(1)v(2)v(k)v(K). Exploiting this in Eq. 3 and introducing the explicit form of L v(k), k = 1,2, ,K, into L M lets us write L M = ? I vv? kMvk + M ?I vv? kMvk2 + + MK1 ?I vv? kMvkK ! L v(K) v.(4) The details of this derivation are provided in the supplementary material. In our experiments, Eq. 4 is the form we adopt to compute the ED gradients. 2.2Relationship between PI and Analytical ED Gradients We now show that whenK goes to infi nity, the PI gradients of Eq. 4 are the same as the analytical ED ones. Power Iteration Gradients Revisited.To reformulate the PI gradients, we rely on the fact that Mk= VkV= k 1v1v 1 + k 2v2v 2 + + k nvnv n ,(5) and thatkMvk = kvk = , wherev = v1is the dominant eigenvector and = 1is the dominant eigenvalue. Introducing Eq. 5 into Eq. 4, lets us re-write the gradient as L M = ?P n i=2viv i ? 1 + ?P n i=2iviv i ? 2 1 + + ?P n i=2 K1 i viv i ? K 1 ! L v(K) 1 v 1 = n X i=2 ? 1/1+1/1(i/1)1+ +1/1(i/1)K1 ? viv i ! L/v(K) 1 v 1 . (6) Eq. 6 defi nes a geometric progression. Given that 1 (i/1)k 1, when k , because|i/1|1, we have 1 1 + 1 1 ? i 1 ?1 + + 1 1 ? i 1 ?k1 = 1 1(1 ( i 1) k) 1 i 1 1 1 1 i 1 ,when k .(7) Introducing Eq. 7 into Eq. 6 yields L M = n X i=2 1 1 1 i 1 ! viv i ! L v(k) 1 v 1 = n X i=2 viv i 1 i ! L v(k) 1 v 1 .(8) Analytical ED GradientsAs shown in 3, the analytic form of the ED gradients can be written as L M = v1 (? e K ? v 1 L v1 ? + ? L ? diag ) v 1 ,(9) where e K given by Eq. 1. By making use of the same properties as before, this can be re-written as L M = n X i=2 1 1 i viv i L v1 v 1 + ? ? ? L i viv i .(10) The the last term of Eq. 10 can be ignored becauseiis not involved in any computation during the forward pass. Instead, we compute the eigenvalue as the Rayleigh quotientv Mv/vv, which only depends on the eigenvector and thus only need the gradients w.r.t. to them. A detailed derivation is provided in the supplementary material. This shows that the partial derivatives ofv1computed using PI have the same form as the analytical ED ones whenk . Similar derivations can be done forvi,i = 2,3,. . This justifi es our use of PI to approximate the analytical ED gradients during backpropagation. We now turn to showing that the resulting gradient estimates are upper-bounded and can therefore not explode. 3 0.01 0.01 0.01 0.01 0.05 0.05 0.05 0.05 0.2 0.2 0.2 0.2 5 10 15 20 25 (a)(b) Figure 1: (a) shows how the value of(k/1)kchanges w.r.t. the eigenvalue ratiok/1and iteration number k. (b) shows the contour of curved surface in (a). i/10.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.85 0.9 0.95 0.99 0.995 0.999 k = dln(0.05)/ln(i/1)e2234569141929592995982995 Table 1: The minimum value of k we need to guarantee (i/1)k i L v1 v 1 n X i=2 ? 1 1 + 1 1 q1+ 1 1 qK1 ? viv i ! L v1 v 1 ,(11) where q=i/1. Therefore, ? ? ? ? L M ? ? ? ? ? ? ? ? ? n X i=2 1 1 + 1 1 ? i 1 ?1 + 1 1 ? i 1 ?2 + 1 1 ? i 1 ?K1! viv i ? ? ? ? ? ? ? ? ? L v1 ? ? ? ? ? ?v 1 ? ? ? ? ? ? ? n X i=2 ? 1 1 + 1 1 ? viv i ? ? ? ? ? ? ? ? ? L v1 ? ? ? ? ? ?v 1 ? ? n X i=2 ? ? ? ? K 1 viv i ? ? ? ? ? ? ? ? L v1 ? ? ? ? ? ?v 1 ? ?nK 1 ? ? ? ? L v1 ? ? ? ? . (12) This yields an upper bound of ? ?L M ? ?. However, if1= 0, this upper bound also becomes. To avoid this, knowing thatM is symmetric positive semi-defi nite, we modify it asM = M+?I, where Iis the identity matrix. This guarantees that the eigenvalues ofM + ?Iare greater than or equal to?. In practice, we set ? = 104. Thus, we can write ? ? ? ? L (M + ?I) ? ? ? ? nK ? ? ? ? ? L v1 ? ? ? ? ,(13) wherenis the dimension of the matrix andKis the power iteration number, which means that choosing a specifi c value ofKamounts to choosing an upper bound for the gradient magnitudes. We now provide an empirical approach to doing so. 2.4Choosing an Appropriate Power Iteration Number K Recall from Eqs. 7 and 8 that, for the PI gradients to provide a good estimate of the analytical ones, we need(i/1)kto go to zero. Fig. 1 shows how the value(i/1)kevolves for different power iteration numberkand ratioi/1. This suggests the need to select an appropriatekfor eachi/1. To this end, we assume that(i/1)k0.05is a good approximation to(i/1)k=0. Then we have (i/1)k0.05 k ln(i/1)ln(0.05) k ln(0.05)/ln(i/1).(14) That is, the minimum value of k to satisfy (i/1)k0.05 is k = dln(0.05)/ln(i/1)e. Table 1 shows the minimum number of iterations required to guarantee that our assumption holds for different values ofi/1. Note that when two eigenvalues are very close to each other, e.g., i/1= 0.999, we need about 3000 iterations to achieve a good approximation. However, this is rare in practice. In CIFAR-10, using ResNet18, we observed that the averagei/i1lies in0.84,0.86 4 interval for the mini-batches. Given the values shown in Table 1, we therefore set the power iteration number to be K = 19, which yields a good approximation in all these cases. 2.5Practical Considerations In practice, we found that other numerical imprecisions due to the computing platform itself, such as the use of single vs double precision, could further affect training with eigendecomposition. Here, we discuss our practical approach to dealing with these issues. The fi rst issue we observed was that, in the forward pass, the eigenvalues computed using SVD may be inaccurate, with SVD sometimes crashing when the matrix is ill-conditioned. Recall that, to increase stability, we add a small value?to the diagonal of the input matrixM. As a consequence, all the eigenvalues should be greater than or equal to? . However, this is not the case when we use fl oat instead of double precision. To solve this problem, we employ truncated SVD. That is, we discard the eigenvectors whose eigenvalue i ? and the subsequent ones. The second issue is related to the precision of the eigenvalues computed using e i= vfMv/vv. Because of the round-off error from f M = f M fMviv i , e imay be inaccurate and sometimes negative. To avoid using incorrect eigenvalues, we also need to truncate it. These two practical issues correspond to the breaking conditionsi ?, e ii i 0.1 defi ned in Alg.1 that provides the pseudo-code for our ZCA whitening application. Furthermore, we add another constraint,i (10.0001), implying that we also truncate the eigendecomposition if the remaining energy infM is less than 0.0001. Note that this last condition can be modifi ed by using a different threshold. To illustrate this, we therefore also perform experiments with a higher threshold, thus leading to our PCA denoising layer, which aims to discard the noise in the network features. As shown in our experiments, our PCA denoising layer achieves competitive performance compared with the ZCA one. Algorithm 1: Forward Pass of ZCA whitening in Practice. Data: =Avg(X), e X=X,M=eXeX+?I, (XRcn); Result: VV=SVD(M); =diag(1,.,n); V=v1, ,vn; i= Pi k=1k/Pn k=1k; Input: E 0, ES I,fM M, rank 1, momentum 0.1 1for i = 1 : n do 2vi= Power Iteration (fM,vi); e i=v i f Mvi/(v i vi); f M =fM fMviv i ; 3if i? or| e ii|/i0.1 or i(10.0001) then 4break; 5else 6rank=i,e=f1, , e i. 7end 8end 9truncate eigenvector matrix: e V v1, ,vrank; 10compute subspace: S e V(e) 1 2eV; 11compute ZCA transformation: X SeX ; 12update running mean:E=momentum + (1 momentum) E; 13update running subspace: ES=momentum S + (1 momentum) ES; Output: compute affi ne transformation: X = X + In Alg.1, the operation:vi= Power Iteration(fM,vi) has no computation involved in the forward pass. It only serves to savefM,vithat will be used to compute the gradients during the backward pass. Furthermore, compared with standard batch normalization 16, we replace the running variance by a running subspaceESbut keep the learnable parameters,. The running meanEand running subspace ESare used in the testing phase. 3Experiments We now demonstrate the effectiveness of our Eigendecomposition layer by using it to perform ZCA whitening and PCA denoising. ZCA whitening has been shown to improve classifi cation performance 5 MethodsError Matrix Dimension d = 4d = 8d = 16d = 32d = 64 SVD Min4.59- Mean4.54 0.08- Success Rate46.7%0%0%0%0% PI Min4.446.28- Mean4.990.51- Success Rate100%6.7%0%0%0% Ours Min4.594.434.404.464.44 Mean4.710.114.620.184.630.144.640.154.59 0.09 Success Rate100%100%100%100%100% Table 2: Errors and success rates using ResNet 18 with standard SVD, Power Iteration (PI), and our method on CIFAR10. d is the size of the feature groups we process individually. Methods Error BNd = 64d = 32d = 16d = 8d = 4 ResNet18 Min21.6821.0421.3621.1421.1521.03 Mean 21.850.14 21.390.23 21.580.27 21.450.25 21.560.35 21.510.28 ResNet50 Min20.7919.2819.2419.7820.1520.66 Mean 21.620.65 19.940.44 19.540.23 19.920.12 20.590.58 20.980.31 Table 3: Errors rates using ResNet18 or ResNet50 with Batch Normalization (BN) or our method on CIFAR100. d is the size of the feature groups we process individually. over batch normalization 2. We will demonstrate that we can deliver a further boost by making it possible to handle larger covariance matrices within the network. PCA denoising has been used for image denoising 17 but not in a deep learning context, presumably because it requires performing ED on large matrices. We will show that our approach solves this problem. In all the experiments below, we use either Resnet18 or Resnet50 19 as our backbone. We retain their original architectures but introduce an additional layer between the fi rst convolutional layer and the fi rst pooling layer. For both ZCA and PCA, the new layer computes the covariance matrix of the feature vectors, eigendecomposes it, and uses the eigenvalues and eigenvectors as described below. As discusse
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- GB/T 32575-2026发电工程数据移交
- 护理工作中的护理评价课件
- 护理技术与患者沟通技巧
- 2025年家庭健康管理师 床垫周期数据的个性化指导服务
- 护理评估中的重症监护
- 压缩机装配调试工诚信道德考核试卷含答案
- 油气管道保护工达标知识考核试卷含答案
- 办公设备与耗材再制造工岗前风险评估与管理考核试卷含答案
- 2026年新科教版高中高一历史上册第一单元先秦政治制度特征卷含答案
- 摊商岗前安全专项考核试卷含答案
- 2026陕西氢能产业发展有限公司(榆林)所属单位社会招聘27人笔试历年参考题库附带答案详解
- GB/T 25000.51-2016系统与软件工程系统与软件质量要求和评价(SQuaRE)第51部分:就绪可用软件产品(RUSP)的质量要求和测试细则
- GB/T 14406-2011通用门式起重机
- 大一《有机化学》题库Word版
- 【自学考试资料】2110考期古文史二全书笔记汇总
- 英语课题结题报告范文
- 支气管哮喘内科学课件
- tax3型机车安全信息综合监测装置用户手册v3
- 低压电工安全培训
- 2022同等学力计算机综合真题无答案解析
- 精神病学课件:精神活性物质所致精神障碍
评论
0/150
提交评论