版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
1、Learning Nonsymmetric Determinantal Point Processes Mike Gartrell Criteo AI Lab m.gartrell Victor-Emmanuel Brunel ENSAE ParisTech victor.emmanuel.brunelensae.fr Elvis Dohmatob Criteo AI Lab e.dohmatob Syrine Krichene Criteo AI Lab syrinekrichene Abstract Determinantal point processes (DPPs) have att
2、racted substantial attention as an elegant probabilistic model that captures the balance between quality and diversity within sets. DPPs are conventionally parameterized by a positive semi-defi nite ker- nel matrix, and this symmetric kernel encodes only repulsive interactions between items. These s
3、o-called symmetric DPPs have signifi cant expressive power, and have been successfully applied to a variety of machine learning tasks, including rec- ommendation systems, information retrieval, and automatic summarization, among many others. Effi cient algorithms for learning symmetric DPPs and samp
4、ling from these models have been reasonably well studied. However, relatively little attention has been given to nonsymmetric DPPs, which relax the symmetric constraint on the kernel. Nonsymmetric DPPs allow for both repulsive and attractive item inter- actions, which can signifi cantly improve mode
5、ling power, resulting in a model that may better fi t for some applications. We present a method that enables a tractable algorithm, based on maximum likelihood estimation, for learning nonsymmetric DPPs from data composed of observed subsets. Our method imposes a particular decomposition of the non
6、symmetric kernel that enables such tractable learning algorithms, which we analyze both theoretically and experimentally. We evaluate our model on synthetic and real-world datasets, demonstrating improved predictive performance compared to symmetric DPPs, which have previously shown strong performan
7、ce on modeling tasks associated with these datasets. 1Introduction Determinantal point processes (DPPs) have attracted growing attention from the machine learning community as an elegant probablistic model for the relationship between items within observed subsets, drawn from a large collection of i
8、tems. DPPs have been well studied for their theoretical properties 1,4,9,13,18,20,21, and have been applied to numerous machine learning applications, including document summarization 7,24, recommender systems 11, object retrieval 1, sensor placement 17, information retrieval 19, and minibatch selec
9、tion 29 . Effi cient algorithms for DPP learning 10,12,14,25,26 and sampling 2,22,27 have been reasonably well studied. DPPs are conventionally parameterized by a positive semi-defi nite (PSD) kernel matrix, and due to this symmetric kernel, they are able to encode only repulsive interactions betwee
10、n items. Despite this limitation, symmetric DPPs have signifi cant expressive power, and have proven effective in the aforementioned applications. However, the ability to encode only repulsive interactions, or negative Currently at Google. 33rd Conference on Neural Information Processing Systems (Ne
11、urIPS 2019), Vancouver, Canada. correlations between pairs of items, does have important limitations in some settings. For example, consider the case of a recommender system for a shopping website, where the task is to provide good recommendations for items to complete a users shopping basket prior
12、to checkout. For models that can only encode negative correlations, such as the symmetric DPP, it is impossible to directly encode positive interactions between items; e.g., a purchased basket containing a video game console would be more likely to also contain a game controller. One way to resolve
13、this limitation is to consider nonsymmetric DPPs, which relax the symmetric constraint on the kernel. Nonsymmetric DPPs allow the model to encode both repulsive and attractive item interactions, which can signifi cantly improve modeling power. With one notable exception 5, little attention has been
14、given to nonsymmetric DPPs within the machine learning community. We present a method for learning fully nonsymmetric DPP kernels from data composed of observed subsets, where we leverage a low-rank decomposition of the nonsymmetric kernel that enables a tractable learning algorithm based on maximum
15、 likelihood estimation (MLE). ContributionsOur work makes the following contributions: We present a decomposition of the nonsymmetric DPP kernel that enables a tractable MLE-based learning algorithm. To the best of our knowledge, this is the fi rst MLE-based learning algorithm for nonsymmetric DPPs.
16、 We present a general framework for the theoretical analysis of the properties of the maximum likelihood estimator for a somewhat restricted class of nonsymmetric DPPs, which shows that this estimator has particular statistical guarantees regarding consistency. Through an extensive experimental eval
17、uation on several synthetic and real-world datasets, we highlight the signifi cant improvements in modeling power that nonsymmetric DPPs provide in comparison to symmetric DPPs. We see that nonsymmetric DPPs are more effective at recovering correlation structure within data, particularly for data th
18、at contains large disjoint collections of items. Unlike previous work on signed DPPs 5, our work does not make the very limiting assumption that the correlation kernel of the DPP is symmetric in the absolute values of the entries. This gives our model much more fl exibility. Moreover, our learning a
19、lgorithm, based on maximum likelihood estimation, allows us to leverage a low rank assumption on the kernel, while the method of moments tackled in 5 does not seem to. Finally, the learning algorithm in 5 has computational complexity ofO(M6), whereMis the size of the ground set (e.g., item catalog),
20、 making it computationally infeasible for most practical scenarios. In contrast, our learning algorithm has substantially lower time complexity of O(M3), which allows our approach to be used on many real-world datasets. 2Background A DPP models a distribution over subsets of a fi nite ground setYtha
21、t is parametrized by a matrix L 2 R|Y|Y|, such that for any J Y, Pr(J) / det(LJ),(1) where LJ= Liji,j2Jis the submatrix of L indexed by J. Since the normalization constant for Eq. 1 follows from the observation that P JYdet(LJ) = det(L + I), we have, for all J Y, PL(J) = det(LJ) det(L + I). (2) With
22、out loss of generality, we will assume thatY = 1,2,.,M, which we also denote byM, where M ? 1 is the cardinality of Y. It is common to assume thatL is a positive semi-defi nite matrix in order to ensure thatPL defi nes a probability distribution on the power set ofM20. More generally, any matrixLwho
23、se principal minorsdet(LJ),J M , are nonnegative, is admissible to defi ne a probability distribution as in (2)5; such matrices are calledP0-matrices. Recall that any matrixLcan be decomposed uniquely as the sum of a symmetric matrixSand a skew-symmetric matrixA. Namely,S = L+L 2 whereas A = L?L 2 .
24、 The following lemma gives a simple suffi cient condition onSforLto be aP0-matrix. Lemma 1. Let L 2 RMMbe an arbitrary matrix. If L + Lis PSD, then L is a P0-matrix. 2 An important consequence is that a matrix of the formD + A, whereDis diagonal with positive diagonal entries andAis skew-symmetric,
25、is aP0-matrix. Such a matrix would only capture nonnegative correlations, as explained in the next section. 2.1Capturing Positive and Negative Correlations When DPPs are used to model real data, they are often formulated in terms of theLmatrix as described above, called anL-ensemble. However, DPPs c
26、an be alternatively represented in terms of the M M matrix K, where K = I ? (L + I)?1. Using the K representation, Pr(J Y ) = det(KJ),(3) whereYis a random subset drawn fromP.Kis called the marginal kernel; since here we are defi ning marginal probabilities that dont need to sum to 1, no normalizati
27、on constant is needed. DPPs are conventionally parameterized by a PSD K or L matrix, which is symmetric. However,KandLneed not be symmetric. As shown in 5,Kis admissible if and only ifLis aP0matrix, that is, all of its principal minors are nonnegative. The class ofP0matrices is much larger, and allo
28、ws us to accommodate nonsymmetricKandLmatrices. To enforce theP0constraint onLduring learning, we impose the decomposition ofLdescribed in Section 4. Since we see as consequence of Lemma 1 that the sum of a PSD matrix and a skew-symmetric matrix is aP0matrix, this allows us to support nonsymmetric k
29、ernels, while ensuring thatLis aP0matrix. As we will see in the following, there are signifi cant advantages to accommodating nonsymmetric kernels in terms of modeling power. As shown in 20, the eigenvalues ofKare bounded above by one, whileLneed only be a PSD or P0matrix. Furthermore, K gives the m
30、arginal probabilities of subsets, while L directly models the atomic probabilities of observed each subset ofY. For these reasons, most work on learning DPPs from data uses the L representation of a DPP. IfJ = iis a singleton set, thenPr(i 2 Y ) = Kii. The diagonal entries ofKdirectly correspond to
31、the marginal inclusion probabilities for each element ofY. IfJ = i,jis a set containing two elements, then we have Pr(i,j 2 Y ) = ? ? ? ? KiiKij KjiKjj ? ? ? ? = KiiKjj? KijKji.(4) Therefore, the off-diagonal elements determine the correlations between pairs of items; that is, cov(1i2Y,1j2Y) = ?KijK
32、ji. For a symmetricK, the signs and magnitudes ofKijandKjiare the same, resulting incov(1i2Y,1j2Y) = ?K2 ij 0. We see that in this case, the off-diagonal elements represent negative correlations between pairs of items, where a larger value ofKijleads to a lower probability ofiandjco-occurring, while
33、 a smaller value ofKijindicates a higher co-occurrence probability. IfKij= 0, then there is no correlation between this pair of items. Since the sign of the?K2 ij term is always nonpositive, the symmetric model is able to capture only nonpositive correlations between items. In fact, symmetric DPPs i
34、nduce a strong negative dependence between items, called negative association 3. For a nonsymmetricK, the signs and magnitudes ofKijandKjimay differ, resulting in cov(1i2Y,1j2Y) = ?KijKji? 0. In this case, the off-diagonal elements represent positive correlations between pairs of items, where a larg
35、er value ofKijKjileads to a higher probability ofi andjco-occurring, while a smaller value ofKijKjiindicates a lower co-occurrence probability. Of course, the signs of the off-diagonal elements for some pairs(i,j)may be the same in a nonsymmetric K, which allows the model to also capture negative co
36、rrelations. Therefore, a nonsymmetricKcan capture both negative and positive correlations between pairs of items. 3General guarantees in maximum likelihood estimation for DPPs In this section we defi ne the log-likelihood function and we study the Fisher information of the model. The Fisher informat
37、ion controls whether the maximum likelihood, computed onniid samples, will be a pn-consistent. When the matrix Lis not invertible (i.e., if it is only aP0-matrix and not a P-matrix), the support ofPL , defi ned as the collection of all subsetsJ Msuch thatPL(J) 6= 0, depends onL , and the Fisher info
38、rmation will not be defi ned in general. Hence, we will assume, in this section, thatLis invertible and that we only maximize the log-likelihood over classes of invertible matrices L. 3 Consider a subsetof the set of allP-matrices of sizeM. Given a collection ofnobserved subsets Y1,.,Yncomposed of i
39、tems fromY = M , our learning task is to fi t a DPP kernelLbased on this data. For all lL 2 , the log-likelihood is defi ned as fn(L) = 1 n n X i=1 logPL(Yi) = X JM pJlogPL(J) = X JM pJlogdet(LJ) ? logdet(L + I) (5) where pJis the proportion of observed samples that equal J. Now, assume thatY1,.,Yna
40、re iid copies of a DPP with kernelL2 . For allL 2 , the population log-likelihood is defi ned as the expectation of fn(L), i.e., f(L) = ElogPL(Y1) = X JM p Jlogdet(LJ) ? logdet(L + I) (6) where p J = E pJ = logPL(J). The maximum likelihood estimator (MLE) is defi ned as a minimizer Lof fn(L)over the
41、 parameter space. Since Lcan be viewed as a perturbed version ofL, it can be convenient to introduce the spaceH defi ned as the linear subspace ofRMMspanned by and defi ne the successive derivatives of fnandfas multilinear forms onH. As we will see later on, the complexity of the model can be captur
42、ed in the size of the spaceH. The following lemma provides a few examples. We say that a matrix L is a signed matrix if for all i 6= j, Li,j= i,jLj,ifor some i,j2 ?1,1. Lemma 2. 1.If is the set of all positive defi nite matrices, it is easy to see thatHis the set of all symmetric matrices. 2. If is
43、the set of all P-matrices, then H = RMM. 3. If is the collection of signed P-matrices, then H = RMM. 4.Ifis the set ofP-matrices of the formS + A, whereSis a symmetric matrix andAis a skew-symmetric matrix (i.e., A= ?A), then H = RMM. 5.Ifis the set of signedP-matrices with known signed pattern (i.e
44、., there exists(i,j)ij ?1,1such that for allL 2 and alli j,Lj,i= i,jLi,j), thenHis the collection of all signed matrices with that same sign pattern. In particular, ifis the set of allP-matrices of the form D + AwhereDis diagonal andAis skew-symmetric, thenHis the collection of all matrices that are
45、 the sum of a diagonal and a skew-symmetric matrix. It is easy to see that the population log-likelihoodf is infi nitely many times differentiable on the relative interior of and that for all L in the relative interior of and H 2 H, df(L)(H) = X JM p Jtr ?L?1 J HJ ? ? tr ?(I + L)?1H? (7) andd2f(L)(H
46、,H) = ? X JM p Jtr ?(L?1 J HJ)2?+ tr ? (I + L)?1H?2 .(8) Hence, we have the following theorem. The case of symmetric kernels is studied in 6 and the following result is a straightforward extension to arbitrary parameter spaces. For completeness, we include the proof in the appendix. For a set RNN, w
47、e call the relative interior ofits interior in the linear space spanned by . Theorem 1.Letbe a set ofP-matrices and letLbe in the relative interior of. Then, for all H 2 H, df(L)(H) = 0. Moreover, the Fisher information is the negative Hessian of f at Land is given by ?d2f(L)(H,H) = Vartr ?(L Y) ?1H
48、Y?, (9) where Y is a DPP with kernel L. It follows that the Fisher information is positive defi nite if and only if any H 2 H that verifi es tr ?(L J) ?1HJ? = 0,8J M(10) must beH = 0. When is the space of symmetric and positive defi nite kernels, the Fisher information is defi nite if and only ifLis
49、 irreducible, i.e., it is not block-diagonal up to a permutation of its rows and columns 6. In that case, it is shown that the MLE learnsLat the speedn?1/2. In general, this property fails and even irreducible kernels can induce a singular Fisher information. 4 Lemma 3. Let be a subset of P-matrices
50、. 1. Let L2 and H 2 H satisfy (10). Then, for all i 2 M, Hi,i= 0. 2.Leti,j 2 Mwithi 6= j. LetL2 be such thatL i,j 6= 0andHsatisfy the following property: 9 6= 0 such that 8H 2 H, Hj,i= Hi,j . Then, if H 2 H satisfi es (10), Hi,j= Hj,i= 0. 3.LetL2 be block diagonal. Then, anyH 2 Hsupported outside of
51、 the diagonal blocks ofL satisfi es (10). In particular, this lemma implies that ifis a class of signedP-matrices with prescribed sign pattern (i.e.,Li,j= i,jLj,ifor alli 6= jand allL 2 , where thei,js are1and do not depend onL), then ifLlies in the relative interior of and has no zero entries, the
52、Fisher information is defi nite. In the symmetric case, it is shown in 6 that the only matricesHsatisfying(10)must be supported off the diagonal blocks ofL, i.e., the third part of Lemma 3 is an equivalence. In the appendix, we provide a few very simple counterexamples that show that this equivalenc
53、e is no longer valid in the nonsymmetric case. 4Model To add support for positive correlations to the DPP, we consider nonsymmetricLmatrices. In particular, our approach involves incorporating a skew-symmetric perturbation to the PSD L. Recall that any matrixLcan be uniquely decomposed asL = S + A,
54、whereSis symmetric and Ais skew-symmetric. We impose a decomposition onAasA = BCT? CBT, whereBandC are low-rankM D0matrices, and we use a low-rank factorization ofS,S = V V T, where Vis a low-rankM Dmatrix, as described in 12, which also allows us to enforceSto be PSD and hence, L to be a P0-matrix
55、by Lemma 1. We defi ne a regularization term, R(V ,B,C), as R(V ,B,C) = ? M X i=1 1 ?i kvik2 2? ? M X i=1 1 ?i kbik2 2? ? M X i=1 1 ?i kcik2 2 (11) where?icounts the number of occurrences of itemiin the training set,vi,bi, andciare the correspondingrowvectorsofV,B, andC, respectively, and,?,? 0aretu
56、nablehyperparameters. This regularization formulation is similar to that proposed in 12. From the above, we have the full formulation of the regularized log-likelihood of our model: ?(V ,B,C) = n X i=1 logdet VYiV T Yi+ (BYiC T Yi? CYiB T Yi) ? logdet V V T + (BCT? CBT) + I + R(V ,B,C)(12) The compu
57、tational complexity of Eq. 12 will be dominated by computing the determinant in the second term (the normalization constant), which isO(M3). Furthermore, since Lij(logdet(L) = tr(L?1 L Lij), the computational complexity of computing the gradient of Eq. 12 during learning will be dominated by computi
58、ng the matrix inverse in the gradient of the second term,(L + I)?1, which isO(M3). Therefore, we see that the low-rank decomposition of the kernel in our nonsymmetric model does not afford any improvement over a full-rank model in terms of computational complexity. However, our low-rank decompositio
59、n does provide a savings in terms of the memory required to store model parameters, since our low-rank model has space complexityO(MD + 2MD0), while a full-rank version of this nonsymmetric model has space complexityO(M2+ 2M2). WhenD M and D0 M, which is typical in many settings, this will result in a signifi cant space savings. 5Experim
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- 容桂消防安全培训证课件
- 家长进课堂食品安全课件
- 家长培训材料课件
- 2026年保险合同财产协议
- 2026年餐饮品牌区域代理合作合同协议
- 2026年废旧金属买卖合同
- 2026年办公系统运维续约合同
- 2026年热力管道维护合同
- 2026年工程险合同协议
- 2026年室内装饰设计施工合同协议
- 2026国家电投招聘试题及答案
- 2025 AHA 心肺复苏与心血管急救指南 - 第6部分:儿童基本生命支持解读
- 2026年大庆医学高等专科学校单招职业技能测试模拟测试卷附答案
- 中央财经大学金融学院行政岗招聘1人(非事业编制)参考笔试题库及答案解析
- 临床试验风险最小化的法律风险防范策略
- 2025年酒店总经理年度工作总结暨战略规划
- 2025年三基超声试题及答案
- 广场景观及铺装工程施工方案
- 贵州兴义电力发展有限公司2026年校园招聘备考题库及一套完整答案详解
- 完整版学生公寓维修改造工程施工组织设计方案
- 2026年“十五五”期间中国速冻食品行业市场调研及投资前景预测报告
评论
0/150
提交评论