web文本挖掘-文档资料

上传人：回*** IP属地：广东上传时间：2022-03-06 格式：PPT 页数：89 大小：1.59MB 积分：20 举报 版权申诉

已阅读5页，还剩84页未读，继续免费阅读

版权说明：本文档由用户提供并上传，收益归属内容提供方，若内容存在侵权，请进行举报或认领

文档简介

1、2022-3-612022-3-62WEB挖掘挖掘n万维网是目前一个巨大的、分布广泛的全球性信息服务中心，它涉及新闻、广告、消费信息、金融管理、教育、政府、电子商务和许多其它信息服务。WEB还包了丰富和动态的超链接信息，以及WEB页面的访问和使用信息，这为数据挖掘提供了丰富的资源。nWEB对有效的资源和知识发现还是具有极大的挑战性。2022-3-63The Web Has Many Other Rich Structures2022-3-64WEB对有效的资源和知识发现具有挑战性对有效的资源和知识发现具有挑战性n对有效的数据仓库和数据挖掘而言，对有效的数据仓库和数据挖掘而言，WEB似乎似乎太庞

2、大了。太庞大了。nWEB的数据量目前以百兆兆字节计算，而且仍然在迅速地增长。许多机构和社团都把各自大量的可访问信息置于网上。这使得几乎不可能去构造一个数据仓库来复制、存储或集成WEB上的所有数据。2022-3-65WEB对有效的资源和知识发现具有挑战性对有效的资源和知识发现具有挑战性nWEB页面的复杂性远比任何传统的文本文档复页面的复杂性远比任何传统的文本文档复杂得多杂得多nWeb页面缺乏同一的结构，它包含了远比任何一组书籍或其它文本文档多得多的风格和内容。Web可以看作一个巨大的数字图书馆；然而，这一图书馆中的大量文档并不根据任何有关排列次序加以组织。他没有分类索引，更没有按标题、作者、封面

3、页等索引。对在这样的图书馆中搜索希望得到的信息是极具挑战性的。2022-3-66WEB对有效的资源和知识发现具有挑战性对有效的资源和知识发现具有挑战性nWeb是一个动态性极强的信息源是一个动态性极强的信息源。nWeb不仅以极快的速度增长，而且其信息还在不断地发生着更新。新闻、股票市场、公司广告和web服务中心都在不断更新着各自的页面。链接信息和访问记录也在频繁地更新之中。nWeb面对的是一个广泛的行行色色的用户群体。面对的是一个广泛的行行色色的用户群体。n目前因特网上连接有约五千万台工作站，其用户群仍在不断地扩展中。各个用户可以有不同的背景、兴趣和使用目的。大部分用户并不了解信息网络结构，不清

4、楚搜索的高昂代价，极容易在“黑暗”的网络中迷失方向，也极容易在“跳跃式”访问中翻乱不已和在等待一段信息中失去耐心。2022-3-67WEB对有效的资源和知识发现具有挑战性对有效的资源和知识发现具有挑战性nWeb上的信息只有很小的一部分是相关的或有用的。上的信息只有很小的一部分是相关的或有用的。n据说99%的web信息对99%的用户是无用的。虽然这看起来不是很明显，但一个人只是关心web上的很小很小一部分信息确是事实，web所包含的其余信息对用户来说是不感兴趣的，而且会淹没所希望得到的搜索结果。2022-3-68 Finding Information On the Web两种获得两种获得web

5、信息的方法信息的方法:nBrowsing: From a starting point, navigate through hyperlinks to find desired documents.nYahoos category hierarchy facilitates browsing.nSearching: Submit a query to a search engine to find desired documents.nMany well-known search engines on the Web: Google, MSN, Yahoo, AltaVista, Fast,

6、Lycos, , etc.nSearching is the second most popular activities on the Web behind email.2022-3-69 浏览和查找方式的比较浏览和查找方式的比较nCategory hierarchy is built mostly manually while search engine databases can be created automatically.nSearch engines can index much more documents than a category hierarchy.nBrowsin

7、g is more accurate and more focused (less junk will be encountered) than searching.2022-3-610WEB挖掘挖掘n尽管传统的搜索引擎和新一代的搜索引擎Google等在一定程度上满足了人们信息检索的需要，但搜索引擎的查全率、查准率都不尽如人意。于是，人们想到了数据挖掘技术，将传统的数据挖掘同web结合起来进行web挖掘，从web文档和web活动中抽取用户感性趣的潜在的有用模式和隐藏信息，弥补了搜索引擎的不足。2022-3-611WEB挖掘及相关概念挖掘及相关概念nWeb挖掘时针对网络信息资源进行的，其中涉及一

8、些同传统的数据挖掘不同的知识和概念。比如IP地址、网页的HIML语言、WEB页面的URL地址和WEB服务器的日志记录等。2022-3-612WEB挖掘及相关概念挖掘及相关概念nWEB挖掘中用到的术语和概念nwww组织在1999年制定了一套规范的web范围相关术语。nIP地址/域名。nURL统一资源定位器n超级链接hyperlinkn超文本标记语言htmlnXml可扩展标记语言n代理服务器proxy servernWeb服务日志n搜索引擎n网络蜘蛛2022-3-613搜索引擎搜索引擎为什么查询很重要？为什么查询很重要？n信息就在你的指尖nFundamental & pervasiven在

9、线广告n查询是一个在线广告的分布渠道2022-3-614搜索引擎搜索引擎为什么查询很重要？为什么查询很重要？2022-3-615搜索引擎的发展搜索引擎的发展nWeb Search 1.0 Traditional Text RetrievalnWeb Search 2.0 Page-level Relevance RankingnWeb Search 3.0 Object-level Structured Search2022-3-616The Trend in Web Search1990Mid 199020042022-3-617Web Search 1.0: Traditional Tex

10、t RetrievalnRelevance ranking based on term distributionn Term frequency (TF) * Inverse document frequency (IDF)n Language modelsn 2022-3-618Web Search 1.0: Traditional Text Retrieval2022-3-619Web Page Has Richer Structure Than Plain Textn Different term types and formatsn Hyperlink structuren 2D vi

11、sual layout structure2022-3-620Web Search 1.0 Web Search 2.0nThe first major improvement in the history of Web searchn Link analysis PageRank & HITSn Relevance ranking = IR Score + PageRank2022-3-621Web Search 2.0 Web Search 3.02022-3-622Object Level Vertical Search (MSRA Libra) Object Level Ver

12、tical Search (MSRA Libra)2022-3-623Web Object Identification2022-3-624Object-level Link Analysis2022-3-6252022-3-626网络蜘蛛网络蜘蛛n网络蜘蛛即Web Spider，是一个很形象的名字。把互联网比喻成一个蜘蛛网，那么Spider就是在网上爬来爬去的蜘蛛。网络蜘蛛是通过网页的链接地址来寻找网页，从网站某一个页面（通常是首页）开始，读取网页的内容，找到在网页中的其它链接地址，然后通过这些链接地址寻找下一个网页，这样一直循环下去，直到把这个网站所有的网页都抓取完为止。如果把整个互联网当

13、成一个网站，那么网络蜘蛛就可以用这个原理把互联网上所有的网页都抓取下来。 2022-3-627The previous Web: Search used to be “crawl and index”2022-3-628 网络蜘蛛网络蜘蛛A Web crawler (also known as spider, robot) is a program for fetching web pages from the Web.Main idea:Place some initial URLs into a URL queue.Repeat the steps below until the queu

14、e is emptynTake the next URL from the queue and fetch the web page using HTTP.nExtract new URLs from the downloaded web page and add them to the queue.nA free Web crawler: /rcm/websphinx/ 2022-3-629网络蜘蛛网络蜘蛛n对于搜索引擎来说，要抓取互联网上所有的网页几乎是不可能的，从目前公布的数据来看，容量最大的搜索引擎也不过是抓取了整个网页数量的百分之四十左右。n这

15、其中的原因一方面是抓取技术的瓶颈，无法遍历所有的网页，有许多网页无法从其它网页的链接中找到；2022-3-630网络蜘蛛网络蜘蛛n另一个原因是存储技术和处理技术的问题，如果按照每个页面的平均大小为20K计算（包含图片），100亿网页的容量是1002000G字节，即使能够存储，下载也存在问题（按照一台机器每秒下载20K计算，需要340台机器不停的下载一年时间，才能把所有网页下载完毕）。n同时，由于数据量太大，在提供搜索时也会有效率方面的影响。因此，许多搜索引擎的网络蜘蛛只是抓取那些重要的网页，而在抓取的时候评价重要性主要的依据是某个网页的链接深度。 2022-3-631网络蜘蛛网络蜘蛛n在抓取网

16、页的时候，网络蜘蛛一般有两种策略：广度优先和深度优先（如下图所示）。广度优先是指网络蜘蛛会先抓取起始网页中链接的所有网页，然后再选择其中的一个链接网页，继续抓取在此网页中链接的所有网页。这是最常用的方式，因为这个方法可以让网络蜘蛛并行处理，提高其抓取速度。2022-3-632网络蜘蛛网络蜘蛛n深度优先是指网络蜘蛛会从起始页开始，一个链接一个链接跟踪下去，处理完这条线路之后再转入下一个起始页，继续跟踪链接。这个方法有个优点是网络蜘蛛在设计的时候比较容易。两种策略的区别，下图的说明会更加明确。 2022-3-633网络蜘蛛网络蜘蛛2022-3-634网络蜘蛛网络蜘蛛n由于不可能抓取所有的网页，有些

17、网络蜘蛛对一些不太重要的网站，设置了访问的层数。例如，在上图中，A为起始网页，属于0层，B、C、D、E、F属于第1层，G、H属于第2层，I属于第3层。n如果网络蜘蛛设置的访问层数为2的话，网页I是不会被访问到的。这也让有些网站上一部分网页能够在搜索引擎上搜索到，另外一部分不能被搜索到。对于网站设计者来说，扁平化的网站结构设计有助于搜索引擎抓取其更多的网页。 2022-3-635网络蜘蛛网络蜘蛛n网络蜘蛛在访问网站网页的时候，经常会遇到加密数据和网页权限的问题，有些网页是需要会员权限才能访问。n网站的所有者可以通过协议让网络蜘蛛不去抓取，但对于一些出售报告的网站，他们希望搜索引擎能搜索到他们的

18、报告，但又不能完全免费的让搜索者查看，这样就需要给网络蜘蛛提供相应的用户名和密码。n网络蜘蛛可以通过所给的权限对这些网页进行网页抓取，从而提供搜索。而当搜索者点击查看该网页的时候，同样需要搜索者提供相应的权限验证。 2022-3-636网站与网络蜘蛛 n网络蜘蛛需要抓取网页，不同于一般的访问，如果控制不好，则会引起网站服务器负担过重。今年4月，淘宝网（ http:/ ）就因为雅虎搜索引擎的网络蜘蛛抓取其数据引起淘宝网服务器的不稳定。n网站是否就无法和网络蜘蛛交流呢？其实不然，有多种方法可以让网站和网络蜘蛛进行交流。一方面让网站管理员了解网络蜘蛛都来自哪儿，做了些什么，另一方面也告诉网络蜘蛛哪些

19、网页不应该抓取，哪些网页应该更新。 2022-3-637网络蜘蛛网络蜘蛛n每个网络蜘蛛都有自己的名字，在抓取网页的时候，都会向网站标明自己的身份。网络蜘蛛在抓取网页的时候会发送一个请求，这个请求中就有一个字段为Useragent，用于标识此网络蜘蛛的身份。n例如:nGoogle网络蜘蛛的标识为GoogleBot，nBaidu网络蜘蛛的标识为BaiDuSpider，nYahoo网络蜘蛛的标识为Inktomi Slurp。n如果在网站上有访问日志记录，网站管理员就能知道，哪些搜索引擎的网络蜘蛛过来过，什么时候过来的，以及读了多少数据等等。如果网站管理员发现某个蜘蛛有问题，就通过其标识来和其所有者联

20、系。 2022-3-638网络蜘蛛网络蜘蛛2022-3-639网络蜘蛛网络蜘蛛n网络蜘蛛进入一个网站，一般会访问一个特殊的文本文件Robots.txt，这个文件一般放在网站服务器的根目录下，如： http:/ 。n网站管理员可以通过robots.txt来定义哪些目录网络蜘蛛不能访问，或者哪些目录对于某些特定的网络蜘蛛不能访问。2022-3-640网络蜘蛛网络蜘蛛n例如有些网站的可执行文件目录和临时文件目录不希望被搜索引擎搜索到，那么网站管理员就可以把这些目录定义为拒绝访问目录。n当然，Robots.txt只是一个协议，如果网络蜘蛛的设计者不遵循这个协议，网站管理员也无法阻止网络蜘蛛对于某些页面

21、的访问，但一般的网络蜘蛛都会遵循这些协议，而且网站管理员还可以通过其它方式来拒绝网络蜘蛛对某些网页的抓取。 2022-3-641网络蜘蛛网络蜘蛛n现在一般的网站都希望搜索引擎能更全面的抓取自己网站的网页，因为这样可以让更多的访问者能通过搜索引擎找到此网站。n为了让本网站的网页更全面被抓取到，网站管理员可以建立一个网站地图，即Site Map。n许多网络蜘蛛会把sitemap.htm文件作为一个网站网页爬取的入口，网站管理员可以把网站内部所有网页的链接放在这个文件里面，那么网络蜘蛛可以很方便的把整个网站抓取下来，避免遗漏某些网页，也会减小对网站服务器的负担。 2022-3-642更新周期更新周期

22、 n由于网站的内容经常在变化，因此网络蜘蛛也需不断的更新其抓取网页的内容，这就需要网络蜘蛛按照一定的周期去扫描网站，查看哪些页面是需要更新的页面，哪些页面是新增页面，哪些页面是已经过期的死链接。 2022-3-643更新周期更新周期 n搜索引擎的更新周期对搜索引擎搜索的查全率有很大影响。n如果更新周期太长，则总会有一部分新生成的网页搜索不到；n周期过短，技术实现会有一定难度，而且会对带宽、服务器的资源都有浪费。n搜索引擎的网络蜘蛛并不是所有的网站都采用同一个周期进行更新，对于一些重要的更新量大的网站，更新的周期短，如有些新闻网站，几个小时就更新一次；相反对于一些不重要的网站，更新的周期就长，可

23、能一两个月才更新一次。2022-3-644更新周期更新周期 n一般来说，网络蜘蛛在更新网站内容的时候，不用把网站网页重新抓取一遍，对于大部分的网页，只需要判断网页的属性（主要是日期），把得到的属性和上次抓取的属性相比较，如果一样则不用更新。 2022-3-645 网络蜘蛛网络蜘蛛一些具体的问题一些具体的问题:What initial URLs to use? Choice depends on type of search engines to be built.nFor general-purpose search engines, use URLs that are likely to r

24、each a large portion of the Web such as the Yahoo home page.1.For local search engines covering one or several organizations, use URLs of the home pages of these organizations. In addition, use appropriate domain constraint.2022-3-646 网络蜘蛛网络蜘蛛Several research issues about Web crawlers:nFocused-crawl

25、ing: Fetch web pages in a specified subject area such as movies and sports for creating domain-specific search engines.nEfficient re-crawling: Re-fetch web pages to keep web page index up-to-date.nDeep-crawling: Crawl documents in the deep Web.2022-3-647Deep webnStatic Web vs. Dynamic WebnCriterion:

26、 Frequency of changenVisible Web vs. Invisible WebnCriterion: can be “seen” by crawlersnSurface Web vs. Deep WebnCriterion: protocol & database support2022-3-648The previous Web: Search used to be “crawl and index”2022-3-649The current Web: Search must eventually resort to integration2022-3-650链

27、接分析链接分析n在过去几年之内，Google成为了全世界被使用的最多的搜索引擎。与其它搜索引擎比较，除高性能和易用以外，一个决定性的因素是它的优秀的搜索结果。搜索结果的这质量极大地来源于PageRank一个精密的排序网页文件等级的方式。 2022-3-651链接分析链接分析n从万维网的早期，搜索引擎开发不同的方法排序网页。实际上，任一个搜索引擎对网页的排序，是根据搜索的词组短语在页面中的出现次数，并用页面长度和html标签的重要性提示等进行权重修订。2022-3-652链接分析链接分析n为了得到更好的搜索结果，连接人气值的概念开始被开发了。根据这个概念，一个网页文件的入链数量通常表示此文件的重

28、要程度。因此，一般地，如果从其他网页链接到一个网页的数量越多，那么这个网页就越重要。链接人气值的概念通常可以避免那些只被创造出来欺骗搜索引擎并且没有任何实际意义的网页得到好的等级。2022-3-653链接分析链接分析n如此, 在PageRank概念中，文件的等级由与它连接那些文件的等级决定的。它们的等级再由与他们连接文件的等级决定。因此, 文件的PageRank由其他文件的PageRank总递归之和确定。因为，即使是在边缘的少量链接，任一个文件的等级都会影响些其他文件的等级，概言之，PageRank的等级是由整个网的连接结构决定的。虽然这种方法似乎是非常宽泛和复杂的, Page和Brin已经能

29、够通过一个微不足道的运算法则将它投入实践了。 2022-3-654链接分析链接分析Vector spread activation (Yuwono 97)nThe final ranking score of a page p is the sum of its regular similarity and a portion of the similarity of each page that points to p.nRationale: If a page is pointed to by many relevant pages, then the page is also likel

30、y to be relevant.Let sim(q, di) be the regular similarity between q and di; rs(q, di) be the ranking score of di with respect to q; link(j, i) = 1 if dj points to di, = 0 otherwise. rs(q, di) = sim(q, di) + link(j, i) sim(q, dj) = 0.2 is a constant parameter.2022-3-655 链接分析链接分析-PAGERANKPageRank cita

31、tion ranking (Page, Brin, et al. The PageRank Citation Ranking: Bring Order to the Web. Technical report, Stanford U. 98).nWeb can be viewed as a huge directed graph G(V, E), where V is the set of web pages (vertices) and E is the set of hyperlinks (directed edges).nEach page may have a number of ou

32、tgoing edges (forward links) and a number of incoming links (backlinks).nEach backlink of a page represents a citation to the page.nPageRank is a measure of global web page importance based on the backlinks of web pages.2022-3-656 链接分析链接分析-PAGERANKPageRank is based on the following basic ideas:nIf a

33、 page is linked to by many pages, then the page is likely to be important.nIf a page is linked to by important pages, then the page is likely to be important even though there arent too many pages linking to it.nThe importance of a page is divided evenly and propagated to the pages pointed to by it.

34、10552022-3-657 链接分析链接分析-PAGERANKPageRank DefinitionLet u be a web page, Fu be the set of pages u points to, Bu be the set of pages that point to u, Nu = |Fu| be the number of pages in Fu.The rank (importance) of a page u can be defined by: R(u) = ( R(v) / Nv ) v Bu2022-3-658 链接分析链接分析-PAGERANKPageRan

35、k is defined recursively and can be computed iteratively.nInitiate all page ranks to be 1/N, N is the number of vertices in the Web graph.nIn i-th iteration, the rank of a page is computed using the ranks of its parent pages in (i-1)th iteration. Repeat until all ranks converge.Let Ri(u) be the rank

36、 of page u in ith iteration and R0(u) be the initial rank of u. Ri(u) = ( Ri-1(v) / Nv ) v Bu2022-3-659链接分析链接分析-PAGERANKn r I 1=0* r i-1 1 + 1/2* r i-1 2 + 0* r i-1 3 + 0* r i-1 4 n r I 2=0* r i-1 1 + 1/2* r i-1 2 + 0* r i-1 3 + 0* r i-1 4 0 0 1RI=ri1ri2ri3ri4ri-1 1ri-1 2ri-1 3ri-1 4=2022-3-660 链接分析

37、链接分析-PAGERANKMatrix representation Let M be an N N matrix and muv be the entry at the u-th row and v-th column. muv = 1/Nv if page v has a link to page u muv = 0 if there is no link from v to u Let Ri be the N 1 rank vector for i-th iteration and R0 be the initial rank vector. Then Ri = M Ri-12022-3

38、-661 链接分析链接分析-PAGERANKExample: Consider the following partial Web graph: Ri = M = Ri+1 =ABCD0 0 1 ABCDA B C Dri1ri2ri3ri4ri-1 1ri-1 2ri-1 3ri-1 42022-3-662 链接分析链接分析-PAGERANKIf the ranks converge, i.e., there is a rank vector R such that R = M R, R is the eigenvector of matrix M with eigenvalue being

39、 1.Convergence is guaranteed only ifnM is aperiodic (the Web graph is not a big cycle). This is practically guaranteed for Web.nM is irreducible (the Web graph is strongly connected). This is usually not true.2022-3-663 链接分析链接分析-PAGERANKRank sink: A page or a group of pages is a rank sink if they ca

40、n receive rank propagation from its parents but cannot propagate rank to other pages.Rank sink causes the loss of total ranks.Example:ABCD(C, D) is a rank sink2022-3-664 链接分析链接分析-PAGERANKA solution to the non-irreducibility and rank sink problem.nConceptually add a link from each page v to every pag

41、e (include self).nIf v has no forward links originally, make all entries in the corresponding column in M be 1/N.nIf v has forward links originally, replace 1/Nv in the corresponding column by c 1/Nv and then add (1-c) 1/N to all entries, 0 c 0 = 0, otherwise where 0 w 1.nBoth sim(q, d) and R(d) nee

42、d to be normalized to between 0, 1.2022-3-670链接分析链接分析-PAGERANKnPageRank算法实质上是一种通过离线对整个互联网结构图进行幂迭代的方法.nPageRank所计算出的价值度的值实际上就是互联网结构图经过修改后的相邻矩阵的特征值.对这些值的计算有非常有效的方法(事实上,仅需要若干次的迭代计算即可以得到),因此能够很好地应用到整个互联网规模的实践中.n这种方法的另一个主要优点是所有的处理过程都是离线进行的,因此不会为在线的查询过程付出额外的代价.2022-3-671链接分析链接分析-PAGERANKn但是,PageRank算法也同样存

43、在一个显著的问题,即价值度的计算是不是针对查询的.n对于某个特定主题的查询,在返回结果中一些与主题无关的“强壮”网页将会排在较前的位置.n比如,PageRank会把网页网页排在链接分析链接分析-PAGERANKnPageRank defines the global importance of web pages but the importance is domain/topic independent.nWe often need to find important/authoritative pages which are relevant to a given query.nWhat

44、 are important web browser pages?nWhich pages are important game pages?nKleinberg (Kleinberg 98) proposed to use authority and hub scores to measure the importance of a web page with respect to a given query.2022-3-673 链接分析链接分析-HITSThe basic idea:nA page is a good authoritative page with respect to

45、a given query if it is referenced (i.e., pointed to) by many (good hub) pages that are related to the query.nA page is a good hub page with respect to a given query if it points to many good authoritative pages with respect to the query.nGood authoritative pages (authorities) and good hub pages (hub

46、s) reinforce each other.2022-3-674 链接分析链接分析-HITSnAuthorities and hubs related to the same query tend to form a bipartite subgraph of the web graph.nA web page can be a good authority and a good hub.hubsauthorities2022-3-675 链接分析链接分析-HITSMain steps of the algorithm for finding good authorities and hu

47、bs related to a query q.1.Submit q to a regular similarity-based search engine. Let S be the set of top n pages returned by the search engine. (S is called the root set and n is often in the low hundreds).2022-3-676 链接分析链接分析-HITSMain steps of the algorithm for finding good authorities and hubs relat

48、ed to a query q.Expand S into a large set T (base set):Add pages that are pointed to by any page in S.2.Add pages that point to any page in S. If a page has too many parent pages, only the first k parent pages will be used for some k.2022-3-677 链接分析链接分析-HITS3. Find the subgraph SG of the web graph t

49、hat is induced by T.ST2022-3-678 链接分析链接分析-HITSCompute the authority score and hub score of each web page in T based on the subgraph SG(V, E). Given a page p, let a(p) be the authority score of p h(p) be the hub score of p (p, q) be a directed edge in E from p to q. Two basic operations:nOperation I:

50、 Update each a(p) as the sum of all the hub scores of web pages that point to p.nOperation O: Update each h(p) as the sum of all the authority scores of web pages pointed to by p.2022-3-679 链接分析链接分析-HITSOperation I: for each page p: a(p) = h(q) q: (q, p) EOperation O: for each page p: h(p) = a(q) q:

51、 (p, q) Eq1q2q3pq3q2q1p2022-3-680 链接分析链接分析-HITSMatrix representation of operations I and O.Let A be the adjacency matrix of SG: entry (p, q) is 1 if p has a link to q, else the entry is 0.Let AT be the transpose of A.Let hi be the vector of hub scores after i iterations.Let ai be the vector of autho

52、rity scores after i iterations. Operation I: ai = AT hi-1 Operation O: hi = A ai2022-3-681 链接分析链接分析-HITS After each iteration of applying Operations I and O, normalize all authority and hub scores. Repeat until the scores for each page converge (the convergence is guaranteed).5. Sort pages in descen

53、ding authority scores.6. Display the top authority pages.Vqqapapa2)()()(Vqqhphph2)()()(2022-3-682 链接分析链接分析-HITSAlgorithm (summary) submit q to a search engine to obtain the root set S; expand S into the base set T; obtain the induced subgraph SG(V, E) using T; initialize a(p) = h(p) = 1 for all p in

54、 V; for each p in V until the scores converge apply Operation I; apply Operation O; normalize a(p) and h(p); return pages with top authority scores;2022-3-683 链接分析链接分析-HITSExample: Initialize all scores to 1.1st Iteration: I operation: a(q1) = 1, a(q2) = a(q3) = 0, a(p1) = 3, a(p2) = 2 O operation:

55、h(q1) = 5, h(q2) = 3, h(q3) = 5, h(p1) = 1, h(p2) = 0 Normalization: a(q1) = 0.267, a(q2) = a(q3) = 0, a(p1) = 0.802, a(p2) = 0.535, h(q1) = 0.645, h(q2) = 0.387, h(q3) = 0.645, h(p1) = 0.129, h(p2) = 0q1q2q3p1p22022-3-684 链接分析链接分析-HITSAfter 2 Iterations: a(q1) = 0.061, a(q2) = a(q3) = 0, a(p1) = 0.791, a(p2) = 0.609, h(q1) = 0.656, h(q2) = 0.371, h(q3) = 0.656, h(p1) = 0.029, h(p2) = 0After 5 Iter

人人文库> 全部分类> 专业文献 > 工程机械

温馨提示

1. 本站所有资源如无特殊说明，都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
2. 本站的文档不包含任何第三方提供的附件图纸等，如果需要附件，请联系上传者。文件的所有权益归上传用户所有。
3. 本站RAR压缩包中若带图纸，网页内容里面会有图纸预览，若没有图纸预览就没有图纸。
4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
5. 人人文库网仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对用户上传分享的文档内容本身不做任何修改或编辑，并不能对任何下载内容负责。
6. 下载文件中如有侵权或不适当内容，请与我们联系，我们立即纠正。
7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

web文本挖掘-文档资料

文档简介

温馨提示

最新文档

评论

web文本挖掘-文档资料

文档简介

温馨提示

最新文档

评论

相关文档