版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
1、Information Retrieval Models,PengBo Oct 30, 2010,上次课回顾,Basic Index Techniques Inverted index Dictionary = Query: revenue down P(Q|d1) = (1/8 + 2/16)/2 x (1/8 + 1/16)/2 = 1/8 x 3/32 = 3/256 P(Q|d2) = (1/8 + 2/16)/2 x (0 + 1/16)/2 = 1/8 x 1/32 = 1/256 Ranking: d1 d2,Alternative Models of Text Generati
2、on,Query Model,Query,Doc Model,Doc,Searcher,Writer,Is this the same model?,Retrieval Using Language Models,Query Model,Query,Doc Model,Doc,1,2,3,Query likelihood (1),Document likelihood (2),Model comparison (3),Query Likelihood,P(Q|Dm) 主要问题是估计文档model i.e. smoothing techniques instead of tf.idf weigh
3、ts 检索效果不错 e.g. UMass, BBN, Twente, CMU 问题:处理relevance feedback, query expansion, structured queries困难,Document Likelihood,按P(D|R)/P(D|NR)排序 P(w|R) is estimated by P(w|Qm),Qm is the query or relevance model P(w|NR) is estimated by collection probabilities P(w) 问题是估计relevance model Treat query as gene
4、rated by mixture of topic and background Estimate relevance model from related documents (query expansion) Relevance feedback is easily incorporated Good retrieval results e.g. UMass at SIGIR 01 inconsistent with heterogeneous document collections,Model Comparison,估计query和document模型,进行模型比较 KL diverg
5、ence D(Qm|Dm) 取得了较前两方法更好的效果,Language models: pro & con,Novel way of looking at the problem of text retrieval based on probabilistic language modeling Conceptually simple and explanatory Formal mathematical model Natural use of collection statistics, not heuristics (almost) LMs provide effective retr
6、ieval and can be improved to the extent that the following conditions can be met Our language models are accurate representations of the data. Users have some sense of term distribution.,Comparison With Vector Space,和传统的tf.idf models有一定联系: (unscaled) term frequency is directly in model the probabili
7、ties do length normalization of term frequencies the effect of doing a mixture with overall collection frequencies is a little like idf: terms rare in the general collection but common in some documents will have a greater influence on the ranking,Comparison With Vector Space,相似点 Term weights based
8、on frequency Terms often used as if they were independent Inverse document/collection frequency used Some form of length normalization used 不同点 Based on probability rather than similarity Intuitions are probabilistic rather than geometric Details of use of document length and term, document, and col
9、lection frequency differ,本次课小结,Latent Semantic Indexing singular value decomposition Matrix Low-rank Approximation LanguageModel Generative model smooth probabilities Mixture model,Resources,The Template Numerical Toolkit (TNT)/tnt/documentation.html The Lemur Toolkit for Language
10、 Modeling and Information Retrieval. /lemur/ CMU/Umass LM and IR system in C(+), currently actively developed.,Thank You!,Q&A,阅读材料,1 IIR Ch12, Ch18 2 M. Alistair, Z. Justin, and H. David, Recommended reading for IR research students SIGIR Forum, vol. 39, pp. 3-14, 2005.,#2 Eval
11、uation,Question a 不能有两个或两个以上的breakeven point 证明:一次检索I,相关文档集为R,设当前为breakeven point,检出文档集为A,检出的相关文档集为Ra,则precision=|Ra|/|A|,recall=|Ra|/|R|,根据breakeven point的定义,precision=recall,推出|R|=|A|。假设再检出k (k0)个文档后,又出现一个breakeven point,则此时的precision=|Ra|/|A|,recall=|Ra|/|R|,推出|A|=|R|。由于|A|=|A|+k,k0,且|A|=|R|,推出矛盾
12、,所以不能有两个或两个以上的breakeven point,注意:当没有检出相关文档时,查全率和查准率都是零,这时是breakevenpoint吗?考虑到这种情况,则可以有两个或两个以上的breakeven point,Matrix Low-rank Approximation for LSI,Eigenvalues & Eigenvectors,Eigenvectors (for a square mm matrix S) How many eigenvalues are there at most?,eigenvalue,(right) eigenvector,Matrix-vector
13、multiplication,has eigenvalues 3, 2, 0 with corresponding eigenvectors,Any vector (say x= ) can be viewed as a combination of the eigenvectors: x = 2v1 + 4v2 + 6v3,Matrix vector multiplication,Thus a matrix-vector multiplication such as Sx (S, x as in the previous slide) can be rewritten in terms of
14、 the eigenvalues/vectors: Even though x is an arbitrary vector, the action of S on x is determined by the eigenvalues/vectors. Suggestion: the effect of “small” eigenvalues is small.,Eigenvalues & Eigenvectors,For symmetric matrices, eigenvectors for distinct eigenvalues are orthogonal,All eigenvalu
15、es of a real symmetric matrix are real.,Example,Let Then The eigenvalues are 1 and 3 (nonnegative, real). The eigenvectors are orthogonal (and real):,Real, symmetric.,Plug in these values and solve for eigenvectors.,Let be a square matrix with m linearly independent eigenvectors Theorem: Exists an e
16、igen decomposition Columns of U are eigenvectors of S Diagonal elements of are eigenvalues of,Eigen/diagonal Decomposition,Diagonal decomposition: why/how,Let U have the eigenvectors as columns:,Then, SU can be written,And S=UU1.,Thus SU=U, or U1SU=,Diagonal decomposition - example,Recall,The eigenv
17、ectors and form,Inverting, we have,Then, S=UU1 =,Recall UU1 =1.,Example continued,Lets divide U (and multiply U1) by,Then, S=,Q,(Q-1= QT ),Why? Stay tuned ,If is a symmetric matrix: Theorem: Exists a (unique) eigen decomposition where Q is orthogonal: Q-1= QT Columns of Q are normalized eigenvectors
18、 Columns are orthogonal. (everything is real),Symmetric Eigen Decomposition,Time out!,What do these matrices have to do with text? Recall m n term-document matrices But everything so far needs square matrices so ,Singular Value Decomposition,For an m n matrix A of rank r there exists a factorization
19、 (Singular Value Decomposition = SVD) as follows:,The columns of U are orthogonal eigenvectors of AAT.,The columns of V are orthogonal eigenvectors of ATA.,Singular Value Decomposition,Illustration of SVD dimensions and sparseness,SVD example,Let,Thus m=3, n=2. Its SVD is,Typically, the singular val
20、ues arranged in decreasing order.,SVD can be used to compute optimal low-rank approximations. Approximation problem: Find Ak of rank k such that Ak and X are both mn matrices. Typically, want k r.,Low-rank Approximation,Frobenius norm,Solution via SVD,Low-rank Approximation,set smallest r-k singular values to zero,column notation: sum of rank 1 matrices,Approximation error,How good (bad) is this approximation? Its the best possible, measur
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- 企业年度预算目标设定方案
- 土工格栅铺设专项施工方案
- 矿区水资源利用与管理方案
- 施工材料二次利用方案
- 2026云南昆明市延安医院招聘编外人员备考题库及答案详解(新)
- 2026国检测试控股集团雄安有限公司招聘6人备考题库及完整答案详解一套
- 2026辽宁省钢结构产业协会招聘执行秘书长备考题库含答案详解(典型题)
- 10.1正确行使诉讼权利 课后练习题 统编版高中政治选择性必修二法律与生活
- 2026北京航空航天大学人工智能学院聘用编软件开发工程师、F岗招聘2人备考题库及一套参考答案详解
- 2026上半年四川内江职业技术学院考核招聘教师及专职辅导员10人备考题库参考答案详解
- 职业本科《大学英语》课程标准
- 《旅客运输心理学》高职全套教学课件
- 江苏省2024年中职职教高考文化统考英语试卷
- 创伤救护-止血、包扎、固定、搬运课件
- 盘扣式梁板立柱共用标准层梁模板
- 2024年建设银行合同标准版本(二篇)
- 头部CTA检查技术
- DB11T 489-2024 建筑基坑支护技术规程
- 常用电气图纸制图规范
- 盐城工业职业技术学院单招职业技能测试参考试题库(含答案)
- 直肠恶性肿瘤的个案护理
评论
0/150
提交评论