版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
1、Screening Sinkhorn Algorithm for Regularized Optimal Transport Mokhtar Z. Alaya LITIS EA4108 University of Rouen Normandy mokhtarzahdi.alaya Maxime Brar LITIS EA4108 University of Rouen Normandy maxime.beraruniv-rouen.fr Gilles Gasso LITIS EA4108 INSA, University of Rouen Normandy gilles.gassoinsa-r
2、ouen.fr Alain Rakotomamonjy LITIS EA4108 University of Rouen Normandy and Criteo AI Lab, Criteo Paris alain.rakotoinsa-rouen.fr Abstract We introduce in this paper a novel strategy for effi ciently approximating the Sinkhorn distance between two discrete measures. After identifying neglectable compo
3、nents of the dual solution of the regularized Sinkhorn problem, we propose to screen those components by directly setting them at that value before entering the Sinkhorn problem. This allows us to solve a smaller Sinkhorn problem while ensuring approximation with provable guarantees. More formally,
4、the approach is based on a new formulation of dual of Sinkhorn divergence problem and on the KKT optimality conditions of this problem, which enable identifi cation of dual components to be screened. This new analysis leads to the SCREENKHORN algorithm. We illustrate the effi ciency of SCREENKHORNon
5、 complex tasks such as dimensionality reduction and domain adaptation involving regularized optimal transport. 1Introduction Computing optimal transport (OT) distances between pairs of probability measures or histograms, such as the earth movers distance 39,34 and Monge-Kantorovich or Wasserstein di
6、stance 38, are currently generating an increasing attraction in different machine learning tasks 37,28,4,22, statistics 18,32,14,6,17, and computer vision 8,34,36, among other applications 27,33. In many of these problems, OT exploits the geometric features of the objects at hand in the underlying s
7、paces to be leveraged in comparing probability measures. This effectively leads to improved performance of methods that are oblivious to the geometry, for example the chi-squared distances or the Kullback-Leibler divergence. Unfortunately, this advantage comes at the price of an enormous computation
8、al cost of solving the OT problem, that can be prohibitive in large scale applications. For instance, the OT between two histograms with supports of equal sizencan be formulated as a linear programming problem that requires generally superO(n2.5)29 arithmetic operations, which is problematic when n
9、becomes larger. AremedytotheheavycomputationburdenofOTliesinaprevalentapproachreferredtoasregularized OT 11 and operates by adding an entropic regularization penalty to the original problem. Such a regularization guarantees a unique solution, since the objective function is strongly convex, and a gr
10、eater computational stability. More importantly, this regularized OT can be solved effi ciently with celebrated matrix scaling algorithms, such as Sinkhorns fi xed point iteration method 35, 26, 23. 33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada. Several w
11、orks have considered further improvements in the resolution of this regularized OT problem. A greedy version of Sinkhorn algorithm, called Greenkhorn 3, allows to select and update columns and rows that most violate the polytope constraints. Another approach based on low-rank approxima- tion of the
12、cost matrix using the Nystrm method induces the Nys-Sink algorithm 2. Other classical optimization algorithms have been considered for approximating the OT, for instance accelerated gradient descent 40, 13, 30, quasi-Newton methods 7, 12 and stochastic gradient descent 20, 1. In this paper, we propo
13、se a novel technique for accelerating the Sinkhorn algorithm when computing regularized OT distance between discrete measures. Our idea is strongly related to a screening strategy when solving a Lasso problem in sparse supervised learning 21. Based on the fact that a transport plan resulting from an
14、 OT problem is sparse or presents a large number of neglectable values 7, our objective is to identify the dual variables of an approximate Sinkhorn problem, that are smaller than a predefi ned threshold, and thus that can be safely removed before optimization while not altering too much the solutio
15、n of the problem. Within this global context, our contributions are the following: From a methodological point of view, we propose a new formulation of the dual of the Sinkhorn divergence problem by imposing variables to be larger than a threshold. This formulation allows us to introduce suffi cient
16、 conditions, computable beforehand, for a variable to strictly satisfy its constraint, leading then to a “screened” version of the dual of Sinkhorn divergence. We provide some theoretical analysis of the solution of the “screened” Sinkhorn divergence, showing that its objective value and the margina
17、l constraint satisfaction are properly controlled as the number of screened variables decreases. From an algorithmic standpoint, we use a constrained L-BFGS-B algorithm 31,9 but provide a careful analysis of the lower and upper bounds of the dual variables, resulting in a well-posed and effi cient a
18、lgorithm denoted as SCREENKHORN. Our empirical analysis depicts how the approach behaves in a simple Sinkhorn divergence computation context. When considered in complex machine learning pipelines, we show that SCREENKHORN can lead to strong gain in effi ciency while not compromising on accuracy. The
19、 remainder of the paper is organized as follow. In Section 2 we briefl y review the basic setup of regularized discrete OT. Section 3 contains our main contribution, that is, the SCREENKHORN algorithm. Section 4 is devoted to theoretical guarantees for marginal violations of SCREENKHORN. In Section
20、5 we present numerical results for the proposed algorithm, compared with the state-of-art Sinkhorn algorithm as implemented in 16. The proofs of theoretical results are postponed to the supplementary material as well as additional empirical results. Notation. For any positive matrixT Rnm , we defi n
21、e its entropy asH(T) = P i,jTijlog(Tij). Letr(T) = T1m Rnandc(T) = T1n Rmdenote the rows and columns sums ofT respectively. The coordinatesri(T)andcj(T)denote thei-th row sum and thej-th column sum of T, respectively. The scalar product between two matrices denotes the usual inner product, that is h
22、T,Wi = tr(TW) = P i,jTijWij,whereT is the transpose ofT. We write1(resp.0) the vector having all coordinates equal to one (resp. zero).(w)denotes the diag operator, such that ifw Rn, then(w) = diag(w1,.,wn) Rnn. For a set of indicesL = i1,.,ik 1,.,n satisfyingi1 Rk and its complementarywL Rnk. The n
23、otation is similar for matrices; given another subset of indicesS = j1,.,jl 1,.,mwithj1 0 is a regularization parameter. We refer to S(,) as the Sinkhorn divergence 11. Dual of Sinkhorn divergence.Below we provide the derivation of the dual problem for the regularized OT problem (1). Towards this en
24、d, we begin with writing its Lagrangian dual function: L(P,w,z) = hC,Pi + hlogP,Pi + hw,P1m i + hz,P1n i. The dual of Sinkhorn divergence can be derived by solvingminPRnm + L(P,w,z). It is easy to check that objective functionP 7 L(P,w,z)is strongly convex and differentiable. Hence, one can solve th
25、e latter minimum by settingPL(P,w,z)to0nm. Therefore, we getP? ij = exp ? 1 (wi + zj+ Cij) 1?,for alli = 1,.,nandj = 1,.,m. Plugging this solution, and setting the change of variablesu = w/ 1/2andv = z/ 1/2, the dual problem is given by min uRn,vRm ?(u,v) := 1 nB(u,v)1m hu,i hv,i ?, (2) whereB(u,v)
26、:= (eu)K(ev)andK := eC/stands for the Gibbs kernel associated to the cost matrixC. We refer to problem(2)as the dual of Sinkhorn divergence. Then, the optimal solutionP? of the primal problem(1)takes the formP?= (eu ?)K(ev?) where the couple(u?,v?) satisfi es: (u?,v?) =argmin uRn,vRm(u,v). Note that
27、 the matrices(eu ?) and(ev ?) are unique up to a constant factor 35. Moreover,P? can be solved effi ciently by iterative Bregman projections 5 referred to as Sinkhorn iterations, and the method is referred to as SINKHORNalgorithm which, recently, has been proven to achieve a near-O(n2) complexity 3.
28、 3Screened dual of Sinkhorn divergence 0100200300400500 0.0020 0.0025 eu ? ev ? uv Figure 1: Plots of(eu ?,ev?) with(u?,v?)is the pair solu- tion of dual of Sinkhorn diver- gence(2)and the thresholds u,v. Motivation.The key idea of our approach is motivated by the so-called static screening test 21
29、in supervised learning, which is a method able to safely identify inactive features, i.e., features that have zero components in the solution vector. Then, these inactive features can be removed from the optimization problem to reduce its scale. Before diving into detailed algorithmic analysis, let
30、us present a brief illustration of how we adapt static screening test to the dual of Sinkhorn divergence. Towards this end, we defi ne the convex set Cr Rr, forr Nand 0, byCr = w Rr: ewi . In Figure 1, we plot(eu ?,ev?) where(u?,v?)is the pair solution of the dual of Sinkhorn divergence (2) in the p
31、articular case of: n = m = 500, = 1, = = 1 n1n,xi N(0,0),(1 0 0 1),yj N(3,3), ? 10.8 0.81 ?) and the cost matrixCcorresponds to the pairwise euclidean distance, i.e.,Cij= kxi yjk2. We also plot two lines corresponding toeu ? uandev ? vfor someu 0andv 0, choosing randomly and playing the role of thre
32、sholds to select indices to be discarded. If we are able to identify these indices before solving the problem, they can be fi xed at the thresholds and removed then from the optimization procedure yielding an approximate solution. 3 Static screening test. Based on this idea, we defi ne a so-called a
33、pproximate dual of Sinkhorn divergence min uCn ,vCm ? (u,v) := 1nB(u,v)1m hu,i h v ,i ?, (3) which is simply a dual of Sinkhorn divergence with lower-bounded variables, where the bounds are u= 1andv= with 0and 0 being fi xed numeric constants which values will be clear later. The new formulation(3)h
34、as the form of(,/)-scaling problem under constraints on the variablesuandv . Those constraints make the problem signifi cantly different from the standard scaling-problems 24. We further emphasize thatplays a key role in our screening strategy. Indeed, without,euandevcan have inversely related scale
35、 that may lead in, for instanceeubeing too large andev being too small, situation in which the screening test would apply only to coeffi cients ofeuor evand not for both of them. Moreover, it is clear that the approximate dual of Sinkhorn divergence coincides with the dual of Sinkhorn divergence(2)w
36、hen = 0and = 1. Intuitively, our hope is to gain effi ciency in solving problem (3) compared to the original one in Equation(2)by avoiding optimization of variables smaller than the threshold and by identifying those that make the constraints active. More formally, the core of the static screening t
37、est aims at locating two subsets of indices (I,J)in1,.,n 1,.,msatisfying:eui u, and evj v, for all (i,j) I Jand eui0= u, and evj0= v, for all (i0,j0) I J, namely(u,v) Cn u Cmv. The following key result states suffi cient conditions for identifying variables in Iand J. Lemma 1. Let (u,v ) be an optim
38、al solution of problem (3). Defi ne I,= ?i = 1,.,n : i 2 ri(K)?,J,= ?j = 1,.,m : j 2cj(K) ? (4) Then one has eu i= 1and ev j = for all i I ,and j J,. Proof of Lemma 1 is postponed to the supplementary material. It is worth to note that fi rst order optimality conditions applied to(u,v)ensure that if
39、eu i 1theneu i(Kev ) i= iand if ev j thenev j(Keu ) j= 1j, that correspond to the Sinkhorn marginal conditions 33 up to the scaling factor . Screening with a fi xed number budget of points.The approximate dual of Sinkhorn divergence is defi ned with respect toand . As those parameters are diffi cult
40、 to interpret, we exhibit their relations with a fi xed number budget of points from the supports ofand. In the sequel, we denote bynb 1,.,nandmb 1,.,mthe number of points that are going to be optimized in problem (3), i.e, the points we cannot guarantee that eu i= 1and ev j = . 250500 0 5 10 250500
41、 0.0 0.5 1.0 Figure 2: Plots ofand as a function of number bud- get of points for a screening test withnb= mband the parameters,Care set as in Figure(1).(,)tends to(0,1)as(nb,mb)tends to (n,m). Let us defi ne Rnand Rmto be the ordered decreasing vectors of ? r(K)and ? c(K)respectively, that is1 2 n
42、and1 2 m. To keep onlynb-budget andmb-budget of points, the parametersandsatisfy21= nband2 = mb. Hence = (nbmb)1/4and = s mb nb .(5) This guarantees that|I,| = nband|J,| = mbby construction. In addition,(nb,mb)tends to the full number budget of points(n,m), when the couple parameters(,)converges to(
43、0,1). In Figure 2, we plot these convergences, and hence the objective in problem(3) converges to the objective of dual of Sinkhorn divergence (2). We are now in position to formulate the optimization problem related to the screened dual of Sinkhorn. Indeed, using the above analyses, any solution(u,
44、v)of problem(3) satisfi eseu i 1and ev j for all(i,j) (I, J,),andeu i= 1andev j = for all(i,j) (I , J,). Hence, we can restrict the problem(3)to variables inI,andJ,. This boils down to restricting the constraints feasibility Cn Cm to the screened domain defi ned by Usc Vsc, Usc= u Rnb: euI,? 1nb and
45、 Vsc = v Rmb: evJ,? 1mb 4 where the vector comparison?has to be understood elementwise. And, by replacing in Equation(3), the variables belonging to(I , J,)by1and, we derive the screened dual of Sinkhorn divergence problem as min uUsc,vVsc,(u,v) (6) where ,(u,v) = (euI,)K(I,J,)evJ,+ (euI,)K(I,J ,)1m
46、b + 11 nbK(I ,J,)e vJ, I,uI, 1 J,vJ,+ with = 2 P iI ,jJ, Kij log(1) P iI , i 1log() P jJ , j. The above problem uses only the restricted partsK(I,J,), K(I,J ,), andK(I ,J,) of the Gibbs kernelKfor calculating the objective function,. Hence, a gradient descent scheme will also need only those rows/co
47、lumns ofK. This is in contrast to Sinkhorn algorithm which performs alternating updates of all rows and columns ofK. In summary, SCREENKHORNconsists of two steps: the fi rst one is a screening pre-processing providing the active setsI,J,. The second one consists in solving Equation(6)using a constra
48、ined L-BFGS-B 9 for the stacked variable = (uI,vJ,). Pseudocode of our proposed algorithm is shown in Algorithm 1. Note that in practice, we initialize the L-BFGS-B algorithm based on the output of a method, called RESTRICTEDSINKHORN(see Algorithm 2 in the supplementary), which is a Sinkhorn-like al
49、gorithm applied to the active dual variables = (uI,vJ,). While simple and effi cient, the solution of this RESTRICTEDSINKHORN algorithm does not satisfy the lower bound constraints of Problem(6)but provide a good candidate solution.Also note that L-BFGS-B handles box constraints on variables, but it
50、 becomes more effi cient when these box bounds are carefully determined for problem(6). The following proposition (proof in supplementary material) expresses these bounds that are pre-calculated in the initialization step of SCREENKHORN. Proposition1.Let(usc,vsc)beanoptimalpairsolutionofproblem(6)an
51、dKmin=min iI,jJ, Kij. Then, one has miniI,i (m mb) + maxjJ,j nKmin mb eu sc i maxiI,i mKmin ,(7) and minjJ,j (n nb) + maxiI,i mKmin nb ev sc j maxjJ,j nKmin (8) for all i I,and j J,. 4Theoretical analysis and guarantees This section is devoted to establishing theoretical guarantees for SCREENKHORN a
52、lgorithm. We fi rst defi ne the screened marginalssc= B(usc,vsc)1mandsc= B(usc,vsc)1n. Our fi rst theoretical result, Proposition 2, gives an upper bound of the screened marginal violations with respect to 1-norm. Proposition 2. Let (usc,vsc) be an optimal pair solution of problem (6). Then one has
53、k sck2 1= O ? nbc+ (n nb) ?kCk + mb nmc K 3/2 min + m mb nmK min + log ? nm mbc5/2 ? (9) and k sck2 1= O ? mbc1 + (m mb) ?kCk + nb nmc K 3/2 min + n nb nmK min + log ?nm nbc5/2 ? , (10) wherecz= z logz 1forz 0andc= with = miniI,iand = minjJ,j. 5 Algorithm 1: SCREENKHORN(C,nb,mb) Step 1: Screening pr
54、e-processing 1. sort( ? r(K), sort( ? c(K); /(decreasing order) 2. (nbmb)1/4, p mb/nb; 3.I, i = 1,.,n : i 21ri(K),J, j = 1,.,m : j 2cj(K); 4. miniI,i, maxiI,i, minjJ,i, maxjJ,i; 5.u log ? (mmb)+ nKminmb ?, u log? mKmin ?; 6.v log ? (nnb)+ mKminnb ?, v log? nKmin ?; 7. stack( u1nb, v1mb), stack(u1nb,v1mb); Step 2: L-BFGS-B solver on the screened variables 8.u(0) log(1)1nb, v(0) log()1mb; 9. u, v RESTRICTEDSINKHORN(u(0),u(0); 10.(0) stack( u, v); 11. L-BFGS-B(0),); 12.u (1,.,nb),v (nb+1,.,nb+mb); 13.usc i (u)iif i I,and ui log(1) if i I ,; 14.vsc j (v)jif
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- 数字化赋能:液压元件车间资源优化与管理信息系统构建
- 2025年《中级财务管理》测试题及答案
- 数字化浪潮下鞋类商品电子商务平台的构建与运营策略研究
- 数字化浪潮下大连R国际货运代理公司发展战略转型与创新研究
- 数字化浪潮下YZ集团信息化建设项目风险管理探究:策略、实践与创新
- 2025 高中阅读理解之情景交融自然性提升课件
- 锂电池负极材料生产线项目建议书
- 工业固体废弃物综合处置再利用项目规划设计
- 中医院信息化培训及实施方案
- 纸品制造流程自动化升级方案
- 歌词:半生雪(学生版)
- 2025高考数学一轮复习-7.6-利用空间向量求空间角、距离-专项训练【含解析】
- 《 大学生军事理论教程》全套教学课件
- 反推装置 (1)课件讲解
- 英文科技论文写作
- 云县病死畜禽无害化处理项目环评报告
- XX县群文阅读课题中期成果报告:县域性推进小学群文阅读教学实践研究中期研究成果报告课件
- LY/T 2271-2014造林树种与造林模式数据库结构规范
- GB/T 38658-20203.6 kV~40.5 kV交流金属封闭开关设备和控制设备型式试验有效性的延伸导则
- GB/T 19409-2013水(地)源热泵机组
- GB/T 15856.4-2002六角法兰面自钻自攻螺钉
评论
0/150
提交评论