




已阅读5页,还剩15页未读, 继续免费阅读
版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
美国专利 5,920,859李 1999年7月6日超文本文档检索系统和方法摘要A search engine for retrieving documents pertinent to a query indexes documents in accordance with hyperlinks pointing to those documents. The indexer traverses the hypertext database and finds hypertext information including the address of the document the hyperlinks point to and the anchor text of each hyperlink. The information is stored in an inverted index file, which may also be used to calculate document link vectors for each hyperlink pointing to a particular document. When a query is entered, the search engine finds all document vectors for documents having the query terms in their anchor text. A query vector is also calculated, and the dot product of the query vector and each document link vector is calculated. The dot products relating to a particular document are summed to determine the relevance ranking for each document.一个关于查询索引的检索文档搜索引擎与指向该文档的超级链接一致。索引器横贯于整个超文本数据库,查找到超文本信息,包括超链接指向的文档地址和每个超链接的锚文本。信息储存在反向索引文件中,它也可能会被用来计算文档链接向量,因为每个超链接都指向一个特定的文档。当输入查询时,搜索引擎会找到所有包含符合查询关键词的锚文本的文档向量。查询向量被计算出来了,也就得出了其点积和每个文档链接向量。特定文档的点积总和决定文档的相关性排名。Claims权利要求书I claim: 我要求1. A method of indexing documents, the method comprising:索引文档的方法,该方法包括:obtaining a list of hyperlinks pointing to each document, wherein each hyperlink includes one or more terms; indexing each document with the terms in the hyperlinks pointing to that document, wherein a number of hyperlinks, each containing a particular term, may point to a document; and indexing the number of hyperlinks containing the particular term pointing to the document with that document.获取一系列指向每个文档的超链接,其中每个超链接都包含一个或多个关键词;以这些关键词来索引每个文档,其中许多包含特定关键词的超链接可能会指向某一文档;然后用该文档索引这些超链接数目。2. The method of claim 1 wherein:如权利要求1中的方法所述:particular term may appear in hyperlinks pointing to a number of documents; and the number of documents having the particular term in hyperlinks pointing to those documents is indexed with that term.特定的关键词可能出现于指向许多文档的超链接中;索引以该关键词这些文档的数目。3. The method of claim 2 wherein the indexing comprises creating a file listing:如权利要求2中的方法所述,该索引包含创建的一个文件清单:each term; the number of documents having that term in hyperlinks pointing to those documents; a document identifier for each document having that term in hyperlinks pointing to that document; and the number of hyperlinks containing that term pointing to each identified document.关键词;包含该关键词的超链接指向的文档数目;文档标识符,因为每个文档都有包含该关键词指向该文档的超链接;以及包含指向被识别的文档的关键词的超链接数目。4. The method of claim 1 wherein:如权利要求1中的方法所述:particular term may appear in hyperlinks pointing to a number of documents; and the number of documents having the particular term in hyperlinks pointing to those documents is indexed with a document identifier for each document having the particular term in a hyperlink pointing to that document.特定的关键词可能出现于指向许多文档的超链接中;然后以文档标识符索引这些文档的数目。5. The method of claim 4 wherein each document having a particular term in a hyperlink pointing to that document is indexed with an inverse of the number of documents having the particular term in hyperlinks pointing to those documents.如权利要求4的方法所述,每个文档拥有包含特定关键词的被指向该文档的超链接,它以其反向文档被检索。6. The method of claim 1 wherein:如权利要求1中的方法所述:a term may appear a number of times in a hyperlink pointing to a document; and the number of times each term appears in a hyperlink is indexed with the document pointed to by the hyperlink.一个关键词可能会在指向文档的超链接中出现若干次;用被超链接指向的文档索引该关键词出现的次数。7. The method of claim 1 wherein the terms are stemmed words.如权利要求1所述,关键词是词干。8. An apparatus comprising means for performing the method of claim 1.一个设备包括执行权利要求1方法的手段。9. A computer-readable memory device comprising a set of instructions for performing the method of claim 1.一个可读计算机储存设备包括一组执行权利要求1方法的指令。10. A method of ranking documents based on the documents relevance to a query, wherein the query comprises at least one term, and wherein hyperlinks contain terms and point to corresponding documents, the method comprising:文档的排名方法是以文档的相关性为基础的,其中查询至少包括一个关键词,以及超链接含有关键词和相应的文档,该方法包括:comparing the words in the query to the words in a hyperlink to obtain a relevance ranking for each hyperlink; and summing the relevance rankings for each hyperlink pointing to a particular document to obtain a summed relevance score for that document.把查询的词比作超链接中的词,获得每个超链接的相关性排名;合计指向特定文档的超链接的相关性排名来取得该文档的相关性分数总和。11. The method of claim 10 wherein:如权利要求10的方法所述:a number of hyperlinks, each containing a particular term, may point to a document; and the number of hyperlinks containing the particular term pointing to the document is indexed with that document.许多超链接,各自含有特定的关键词,可能会指向同一文档;以该文档索引这些超链接的数目。12. The method of claim 11 wherein:如权利要求11的方法所述:a particular term may appear in hyperlinks pointing to a number of documents; and the number of documents having a particular term in hyperlinks pointing to those documents is indexed with that term.一个特定的关键词可能会出现在指向许多文档的超链接中;以该关键词索引这些文档的数目。13. The method of claim 12 comprising the creation of a list wherein the list indexes:权利要求12的方法包括建立一个列表,其中列表索引:each term; the number of documents having hyperlinks pointing to those documents; a document identifier for each document; and the number of hyperlinks containing that term pointing to each document.关键词;被超链接指向的文档数目;文档的标识符;包含指向每个文档的关键词的超链接数目。14. The method of claim 10 wherein:如权利要求10的方法所述:a particular term may appear in hyperlinks pointing to a number of documents; and the number of documents having the particular term in hyperlinks pointing to those documents is indexed with a document identifier for each document having the particular term in a hyperlink pointing to that document.一个特定的关键词可能会出现在指向许多文档的超链接中;由于每个文档包含特定的关键词在指向该文档的超链接中,以该文档的标识符索引这些文档的数目。15. The method of claim 14 wherein each document having a particular term in a hyperlink pointing to that document is indexed with an inverse of the number of documents having the particular term in hyperlinks pointing to those documents.如权利要求14的方法所述,每个文档拥有特定的关键词在指向该文档的超链接上,用包含特定关键词的超链接指向的反向文档的数目来索引该文档。16. The method of claim 10 wherein:如权利要求10的方法所述:a term may appear a number of times in a hyperlink pointing to a document; and the number of times each term appears in a hyperlink is indexed with the document pointed to by the hyperlink.一个关键词可能会在指向文档的超链接中出现若干次;用被指向的文档索引该关键词出现的次数。17. The method of claim 10 wherein the terms are stemmed words.如权利要求10的方法所述,关键词是词干。18. The method of claim 10 wherein:如权利要求10的方法所述:the query is represented by a query vector wherein the query vector contains a dimension for each term in the query; and each document is represented by document link vectors for each hyperlink pointing to the document, wherein each document link vector contains a dimension for each term in the corresponding hyperlink pointing to that document.查询被表现为查询向量,其中查询向量包含查询中的每个关键词的维度;文档被表现为每个指向文档的超链接的文档链接向量,其中每个文档链接向量包含在指向该文档的相应超链接中每个关键词的维度。19. The method of claim 18 wherein comparing the words in the query to the words in the hyperlink comprises calculating the dot product of the query vector with the document link vector for that hyperlink.如权利要求18的方法所述,把查询的词比作超链接中的词,它包括用超链接的文档链接向量计算的查询向量点积。20. The method of claim 19 wherein summing the relevance ranking for each hyperlink pointing to a document comprises summing the dot products obtained using the document link vectors for a particular document to obtain the summed relevance score for that document.如权利要求19的方法所述,总结每个指向文档的超链接的相关性排名,包括计算点积获得的使用特定文档链接向量来获得该文档的相关性总和。21. The method of claim 20 wherein the summed relevance scores for each document are compared to obtain a ranking of documents.如权利要求20的方法所述,比较文档的相关性总和,以获得文档的排名。22. The method of claim 18 wherein the dimension for a term in a query vector is related to the inverse of the number of documents having a respective hyperlink containing that term pointing to those documents.如权利要求18的方法所述,查询向量中的关键词维度与反向文档的数目有关,反向文档有各自的超链接,它包含指向那些文档的关键词。23. The method of claim 18 wherein the dimension for a term in a document link vector is related to the inverse of the number of documents having a respective hyperlink containing that term pointing to those documents.如权利要求18所述,文档链接向量中的关键词维度与反向文档的数目有关,反向文档有各自的超链接,它包含指向那些文档的关键词。24. An apparatus comprising means for performing the method of claim 10.一个设备包括执行权利要求10的方法的手段。25. A computer-readable memory device comprising a set of instructions for performing the method of claim 10.一个可读计算机储存设备包括一组执行权利要求10的方法的指令。Description说明书FIELD OF INVENTION发明领域The present invention relates to hypertext document retrieval, and more particularly to systems and methods of searching databases distributed over wide-area networks such as the World Wide Web.本发明涉及到超文本文档检索,尤其是系统和搜索数据库的方法,数据库分布在例如万维网的广域网中。BACKGROUND OF THE ART发明背景A hypertext is a database system which provides a unique and non-sequential method of accessing information using nodes and links. Nodes, i.e. documents or files, contain text, graphics, audio, video, animation, images, etc. while links connect the nodes or documents to other nodes or documents. The most popular hypertext or hypermedia system is the World Wide Web, which links various nodes or documents together using hyperlinks, thereby allowing the non-linear organization of text on the web. 超文本即数据系统,它提供一个独特的、非时序性的方法,用节点和链接来存取信息。节点,即文档或文件,包含文本、图形、音频、视频、动画、图像等,而链接则是把这些节点或文档连接起来。最流行的超文本或超媒体系统是万维网,它把用超链接把各种各样的节点或文档连接起来,从而使非线性的文本出现在网络上。A hyperlink is a relationship between two anchors, called the head and the tail of the hyperlink. The head anchor is the destination node or document and the tail anchor is the document or node from which the link begins. On the web, hyperlinks are generally identified by underscoring or highlighting certain text or graphics in a tail anchor document. When a user reviewing the tail document clicks on the highlighted or anchor-text material, the hyperlink automatically connects the users computer with or points to the head anchor document for that particular hyperlink.超链接联系着两个锚点,它们分别被称为超链接的头部和尾部。头部锚点是目的节点或文档,尾部锚点则是链接从文档或节点开始的地方。在网络上,超链接通常被强调或突出显示为某一文本或图形在尾部锚文档上。当用户回顾被突出显示的尾部文档“点击”或“锚文本”材料时,超链接会自动把用户的电脑连接或指向头部锚文档。A hypertext system generally works well when a user has already found a tail document pertaining to the subject matter of interest to that user. The hyperlinks in the tail document are created by the author of the document who generally will have reviewed the material in the head documents of the hyperlinks. Thus, a user clicking on a hyperlink has a high degree of certainty that the material in the head document has some pertinence to the anchor text in the tail document of the hyperlink.当用户已经找到他感兴趣的相关主题的尾部文档时,该超文本系统通常是有效的。尾部文档的超链接是由文档的作者创建的,他通常已经检查过超链接头部文档的资料了。因此,用户点击超链接的行为,在一定程度上保证了该超链接的头部文档的资料与尾部文档的锚文本之间的相关性。As the popularity of the Internet and the Web has grown, the ability to find relevant documents has become increasingly difficult. If a user is unable to find a first document pertaining to the subject matter of interest, the user will of course not be able to use hyperlinks to find additional pertinent documents. Moreover, the location of a single relevant document may not lead to other documents if the author of the relevant document has not created hyperlinks to other relevant web sites. The proliferation of information has, therefore, lead to the development of various search engines which assist users in finding information. Numerous search engines such as Excite, Infoseek, and Yahoo! are now available to users of the Web.随着互联网和网络的流行度增长,查找相关文档的难度越来越大。如果用户找不到第一个感兴趣主题的相关文件,他自然也不会用超链接找到其它的相关文件了。此外,如果相关文件的作者没有创建其它相关网站的话,单一相关文档的位置可能就无法导向其它文档。因此,信息的增加促进了各种搜索引擎的发展,帮助用户更容易找到所需要的信息。现在,网络上可用的搜索引擎有很多,如Excite,Infoseek和Yahoo!等。Search engines usually take a user query as input and attempt to find documents related to that query. Queries are usually in the form of several words which describe the subject matter of interest to the user. Most search engines operate by comparing the query to an index of a document collection in order to determine if the content of one or more of those documents matches the query. Since most casual users of search engines do not want to type in long, specific queries and tend to search on popular topics, there may be thousands of documents that are at least tangentially related to the query. When a search engine has indexed a large document collection, such as the Web, it is particularly likely that a very large number of documents will be found that have some relevance to the query. Most search engines, therefore, output a list of documents to the user where the documents are ranked by their degree of pertinence to the query and/or where documents having a relatively low pertinence are not identified to the user. Thus, the way in which a search engine determines the relevance ranking is extremely important in order to limit the number of documents a user must review to satisfy that users information needs.搜索引擎通常把用户查询看作输入,并试图找到与其相关的文件。查询通常表现为几个描述用户感兴趣主题的词。为了确定得到的文档是否与查询匹配,搜索引擎通常把用户的查询看作是文档集合的索引来进行运算。由于多数的临时用户不喜欢输入过多的文字,及特定的查询更倾向于流行的主题,可能会出现许多与查询不习惯的结果。当搜索引擎已经索引了一个大的文档集合时,比如网络,很有可能会找到大量与查询相关的文档。因此,大多数搜索引擎会列出一张文档的清单,该清单的文档排名由与查询的相关性决定,相关性相对低的文档将不会被识别给用户。所以,为了限制用户需要浏览的文档的数量并满足用户的信息需求,搜索引擎决定相关性排名的方法及其重要。Almost all ranking techniques of search engines depend on the frequency of query terms in a given document. When other related factors are the same, the higher a terms frequency in a given document, the higher the relevance score of this document to a query including that term. Factors other than term frequency, such as such document frequency, i.e. how many documents contain the term, may also be taken into account in determining a relevance score. Once the various factors such as term frequency or document frequency have been determined for a particular query, various models such as the vector space model, probabilistic model, fuzzy logic models, etc. are used to develop a numerical relevance ranking. See, Harman, D., Ranking Algorithms, Chapter 14, Information Retrieval, (Prentice Hall, 1992).几乎所有的搜索引擎排名技术都取决于给出的文档中查询关键词出现的频率。当其它相关因素一样时,关键词在给出的文档中出现的频率越高,该文档的相关分数就越高。在确定相关分数时,除了词频,也可能会考虑到其它影响因素,如文档频(即包含该关键词的文档数量)。一旦各种因素,如词频或文档频,确定了,例如向量空间模型、概率模型、模糊逻辑模型等的各种模型将会被用来开发数值相关性排名。参考Harman, D., Ranking Algorithms, Chapter 14, Information Retrieval, (Prentice Hall, 1992).For instance, in the vector space model, a user query Q is represented as a vector where each query term (qt) is represented as a dimension of a query vector.比如说,在向量空间模型中,查询关键词qt是查询向量的维度,那用户查询q则是向量。Documents in the database are also represented by vectors with each term or key word (dt) in the document represented as a dimension in the vector.数据库的文档也用关键词表现为向量,关键词dt在文档中则表现为向量维度。The relevance score is then calculated as the dot product of Q and D.那么,相关性分数就被计算为Q和D的点积了。The calculation of the value of each dimension for vectors Q or D may be weighted in a variety of ways. The most popular term-weighting formula is:向量Q或D的评估价值会以各种方式进行加权。最为流行的关键词加权公式是:where TF is the term frequency of a given term in a document or query, and IDF.sub.t is the inverse document frequency of the term. The inverse document frequency is the inversion of how many documents in the whole document collection contain the term, i.e.: #EQU1# Using an inverse document frequency insures that junk words such as the, of, as, etc. do not have a high weight. In addition, when a query uses multiple terms, and one of those terms appears in many documents, using an IDF weighting gives a lower ranking to documents containing that term, and a higher ranking to document containing other terms in the query.文档或查询中给出的关键词词频为TF,其反向文档频IDF代替t。反向文档频是指文档集合中包含该关键词的文档的反向数量,即EQU1。使用反向文档频以确保如“这”、“的”、“和”等无用词没有高的权重。此外,当查询多重关键词时,如果其中一个关键词出现在许多文档中,使用IDF加权,就会使得含有该关键词的文档获得较低的排名,而含有其它关键词的文档则获得更高的排名。There are normalized versions of term weighting, which take into account the length of a document including a particular term. The assumption made is that the more frequently a term appears in a document for a given amount of text, the more likely that document is relevant to a query including that term. That assumption may not be true, however, in many cases. For example, if the query is Java tutorial, a document (call it J), which contains 100 lines with each line consisting of just the phrase Java tutorial, would get a very high relevance score and would be output by a search engine as one of the most relevant documents to the user. That document, however, would be useless to the user since it provides no information about a Java tutorial. What the user really needs is a good tutorial for the Java programming language such as found on Suns Java tutorial site (http:/J/tutorial). Unfortunately, the phrase Java tutorial does not occur 100 times on Suns site, and therefore most search engines would incorrectly find Suns site to be less pertinent, and thus have a lower relevance ranking, than Document J.标准化的关键词加权也会考虑到包含特定关键词的文档的长度。假设在一定量的文本中,关键词出现得越频繁,文档与包含该关键词的查询的相关性就越高。但是,在很多情况下,这个设想是不正确的。例如,如果查询的是“Java tutorial”,文档J中有100行只含有“Java tutorial”这个词,那J将获得非常高的相关性分数,并被搜索引擎作为相关性最高的文档呈现给用户。可是,对于用户来说,该文档是无用的,因为它并没有提供任何关于“Java tutorial”的信息。用户真正需要的是一个好的Java程序设计语言指南,像在Suns Java指南网(http:/J/tutorial)上找到的那样。不幸的是,短语“Java tutorial”并没有在Suns网上出现100次,因此多数搜索引擎会不正确地认为Suns网的相关性低,从而获得的相关性排名就低于文档J。Documents such as Document J might not be included in a traditional database because each document in a traditional database is selected or authored for its content rather than the repetition of certain key words. On the Web, where anyone can be a publisher, there is no one to select or screen out document such as J. In fact, some people intentionally draft their documents so that the documents will be retrieved on the top of a ranked list output by search engines that take into account term frequency or normalized term frequency
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- 建筑围护结构与冷热交换机组的热质耦合传质优化
- 市场细分策略下刀形什锦锉医疗级与工业级产品定位冲突
- 二、绘制自选图形教学设计-2025-2026学年初中信息技术(信息科技)七年级下册沪科版
- 面塑树脂配方题库及答案
- 药食同源食品生产人员操作规范方案
- 农行综合管理题库及答案
- 八年级地理上册 第二章 中国的自然环境 第一节地形和地势 第1课时 地形类型多样山区面积广大说课稿 (新版)新人教版
- 计算机一级测试卷及参考答案详解(夺分金卷)
- 临沂兰山区中烟工业2025秋招笔试烟草特色知识题专练及答案
- 2023年度一级建造师每日一练试卷附参考答案详解AB卷
- 消防培训课件教学课件
- 演唱会招商方案
- 课件:《中华民族共同体概论》第一讲 中华民族共同体基础理论
- 子宫颈机能不全临床诊治中国专家共识(2024年版)解读
- 建筑工程消防查验检查表
- 新行政诉讼法课件讲座
- 《世界十大时尚品牌》课件
- 应征公民政治审查表
- 先进制造技术 课件 第一章 先进制造技术概论
- 慢性创面的治疗及护理课件
- 高中定语从句100题(含答案)
评论
0/150
提交评论