考虑到这种情况则可以有两个或两个以上的breakevenpoint.ppt_第1页
考虑到这种情况则可以有两个或两个以上的breakevenpoint.ppt_第2页
考虑到这种情况则可以有两个或两个以上的breakevenpoint.ppt_第3页
考虑到这种情况则可以有两个或两个以上的breakevenpoint.ppt_第4页
考虑到这种情况则可以有两个或两个以上的breakevenpoint.ppt_第5页
免费预览已结束,剩余76页可下载查看

付费下载

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

1、Information Retrieval Models,PengBo Oct 30, 2010,上次课回顾,Basic Index Techniques Inverted index Dictionary = Query: revenue down P(Q|d1) = (1/8 + 2/16)/2 x (1/8 + 1/16)/2 = 1/8 x 3/32 = 3/256 P(Q|d2) = (1/8 + 2/16)/2 x (0 + 1/16)/2 = 1/8 x 1/32 = 1/256 Ranking: d1 d2,Alternative Models of Text Generati

2、on,Query Model,Query,Doc Model,Doc,Searcher,Writer,Is this the same model?,Retrieval Using Language Models,Query Model,Query,Doc Model,Doc,1,2,3,Query likelihood (1),Document likelihood (2),Model comparison (3),Query Likelihood,P(Q|Dm) 主要问题是估计文档model i.e. smoothing techniques instead of tf.idf weigh

3、ts 检索效果不错 e.g. UMass, BBN, Twente, CMU 问题:处理relevance feedback, query expansion, structured queries困难,Document Likelihood,按P(D|R)/P(D|NR)排序 P(w|R) is estimated by P(w|Qm),Qm is the query or relevance model P(w|NR) is estimated by collection probabilities P(w) 问题是估计relevance model Treat query as gene

4、rated by mixture of topic and background Estimate relevance model from related documents (query expansion) Relevance feedback is easily incorporated Good retrieval results e.g. UMass at SIGIR 01 inconsistent with heterogeneous document collections,Model Comparison,估计query和document模型,进行模型比较 KL diverg

5、ence D(Qm|Dm) 取得了较前两方法更好的效果,Language models: pro & con,Novel way of looking at the problem of text retrieval based on probabilistic language modeling Conceptually simple and explanatory Formal mathematical model Natural use of collection statistics, not heuristics (almost) LMs provide effective retr

6、ieval and can be improved to the extent that the following conditions can be met Our language models are accurate representations of the data. Users have some sense of term distribution.,Comparison With Vector Space,和传统的tf.idf models有一定联系: (unscaled) term frequency is directly in model the probabili

7、ties do length normalization of term frequencies the effect of doing a mixture with overall collection frequencies is a little like idf: terms rare in the general collection but common in some documents will have a greater influence on the ranking,Comparison With Vector Space,相似点 Term weights based

8、on frequency Terms often used as if they were independent Inverse document/collection frequency used Some form of length normalization used 不同点 Based on probability rather than similarity Intuitions are probabilistic rather than geometric Details of use of document length and term, document, and col

9、lection frequency differ,本次课小结,Latent Semantic Indexing singular value decomposition Matrix Low-rank Approximation LanguageModel Generative model smooth probabilities Mixture model,Resources,The Template Numerical Toolkit (TNT)/tnt/documentation.html The Lemur Toolkit for Language

10、 Modeling and Information Retrieval. /lemur/ CMU/Umass LM and IR system in C(+), currently actively developed.,Thank You!,Q&A,阅读材料,1 IIR Ch12, Ch18 2 M. Alistair, Z. Justin, and H. David, Recommended reading for IR research students SIGIR Forum, vol. 39, pp. 3-14, 2005.,#2 Eval

11、uation,Question a 不能有两个或两个以上的breakeven point 证明:一次检索I,相关文档集为R,设当前为breakeven point,检出文档集为A,检出的相关文档集为Ra,则precision=|Ra|/|A|,recall=|Ra|/|R|,根据breakeven point的定义,precision=recall,推出|R|=|A|。假设再检出k (k0)个文档后,又出现一个breakeven point,则此时的precision=|Ra|/|A|,recall=|Ra|/|R|,推出|A|=|R|。由于|A|=|A|+k,k0,且|A|=|R|,推出矛盾

12、,所以不能有两个或两个以上的breakeven point,注意:当没有检出相关文档时,查全率和查准率都是零,这时是breakevenpoint吗?考虑到这种情况,则可以有两个或两个以上的breakeven point,Matrix Low-rank Approximation for LSI,Eigenvalues & Eigenvectors,Eigenvectors (for a square mm matrix S) How many eigenvalues are there at most?,eigenvalue,(right) eigenvector,Matrix-vector

13、multiplication,has eigenvalues 3, 2, 0 with corresponding eigenvectors,Any vector (say x= ) can be viewed as a combination of the eigenvectors: x = 2v1 + 4v2 + 6v3,Matrix vector multiplication,Thus a matrix-vector multiplication such as Sx (S, x as in the previous slide) can be rewritten in terms of

14、 the eigenvalues/vectors: Even though x is an arbitrary vector, the action of S on x is determined by the eigenvalues/vectors. Suggestion: the effect of “small” eigenvalues is small.,Eigenvalues & Eigenvectors,For symmetric matrices, eigenvectors for distinct eigenvalues are orthogonal,All eigenvalu

15、es of a real symmetric matrix are real.,Example,Let Then The eigenvalues are 1 and 3 (nonnegative, real). The eigenvectors are orthogonal (and real):,Real, symmetric.,Plug in these values and solve for eigenvectors.,Let be a square matrix with m linearly independent eigenvectors Theorem: Exists an e

16、igen decomposition Columns of U are eigenvectors of S Diagonal elements of are eigenvalues of,Eigen/diagonal Decomposition,Diagonal decomposition: why/how,Let U have the eigenvectors as columns:,Then, SU can be written,And S=UU1.,Thus SU=U, or U1SU=,Diagonal decomposition - example,Recall,The eigenv

17、ectors and form,Inverting, we have,Then, S=UU1 =,Recall UU1 =1.,Example continued,Lets divide U (and multiply U1) by,Then, S=,Q,(Q-1= QT ),Why? Stay tuned ,If is a symmetric matrix: Theorem: Exists a (unique) eigen decomposition where Q is orthogonal: Q-1= QT Columns of Q are normalized eigenvectors

18、 Columns are orthogonal. (everything is real),Symmetric Eigen Decomposition,Time out!,What do these matrices have to do with text? Recall m n term-document matrices But everything so far needs square matrices so ,Singular Value Decomposition,For an m n matrix A of rank r there exists a factorization

19、 (Singular Value Decomposition = SVD) as follows:,The columns of U are orthogonal eigenvectors of AAT.,The columns of V are orthogonal eigenvectors of ATA.,Singular Value Decomposition,Illustration of SVD dimensions and sparseness,SVD example,Let,Thus m=3, n=2. Its SVD is,Typically, the singular val

20、ues arranged in decreasing order.,SVD can be used to compute optimal low-rank approximations. Approximation problem: Find Ak of rank k such that Ak and X are both mn matrices. Typically, want k r.,Low-rank Approximation,Frobenius norm,Solution via SVD,Low-rank Approximation,set smallest r-k singular values to zero,column notation: sum of rank 1 matrices,Approximation error,How good (bad) is this approximation? Its the best possible, measur

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论