chapter05-part01-structure_网络结构挖掘.ppt_第1页
chapter05-part01-structure_网络结构挖掘.ppt_第2页
chapter05-part01-structure_网络结构挖掘.ppt_第3页
chapter05-part01-structure_网络结构挖掘.ppt_第4页
chapter05-part01-structure_网络结构挖掘.ppt_第5页
已阅读5页,还剩27页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

1、2008年6月14日,Web Structure Mining,朱廷劭(Zhu, Tingshao)Ph.D,First generation of search engines,Early days: keyword based searches Keywords: “web mining” Retrieves documents with “web” and mining” Later on: cope with synonymy problem polysemy problem stop words Common characteristic: Only information on t

2、he pagesis used,Modern search engines,Link structure is very important Adding a link: deliberate act Harder to fool systems using in-links Link is a “quality mark” Modern search engines use link structure as important source of information,The Web Structure,A study was conducted on a graph inferred

3、from two large Altavista crawls. Broder, A., Kumar, R., Maghoul, F., Raghavan, P., Rajagopalan, S., Stata, R., Tomkins, A., andWiener, J. (2000). Graph structure in the web. In Proc. WWW Conference. The study confirmed the hypothesis that the number of in-links and out-links to a page approximately

4、follows a Zipf distribution (a particular case of a power-law),Some statistics,Only between 25% of the pages there is a connecting path BUT, If there is a path: Directed: average length between two people only chain of length 6! Small World Graphs High number of relatively small cliques Small diamet

5、er Internet (SCC) is a small world graph,Web structure mining,PageRank (Ranking web pages used by Google) HITS (Topic distillation) Algorithm in Cyber-community,PageRank,Intuition: solve the recursive equation: “a page is important if important pages link to it.” In high-falutin terms: importance =

6、the principal eigenvector of the stochastic matrix of the Web. A few fixups needed.,1. Page-Rank Method,Introduced by Brin and Page (1998) Mine hyperlink structure of web to produce global importance ranking of every web page Used in Google Search Engine Web search result is returned in the rank ord

7、er Treats link as like academic citation Assumption: Highly linked pages are more important than pages with a few links A page has a high rank if the sum of the ranks of its back-links is high,Page Rank: Computation,Assume: R(u): Rank of a web page u Fu: Set of pages which u points to Bu: Set of pag

8、es that points to u Nu: Number of links from u C: Normalization factor E(u): Vector of web pages as source of rank Page Rank Computation:,Page Rank: Implementation,Stanford WebBase project Complete crawling and indexing system of with current repository 24 million web pages (old data) Store each URL

9、 as unique integer and each hyperlink as integer IDs Remove dangling links by iterative procedures Make initial assignment of the ranks Propagate page ranks in iterative manner Upon convergence, add the dangling links back and recompute the rankings,Page Rank: Results,Google utilizes a number of fac

10、tors to rank the search results: proximity, anchor text, page rank The benefits of Page Rank are the greatest for underspecified queries, example: Stanford University query using Page Rank lists the university home page the first,Page Rank: Advantages,Global ranking of all web pages regardless of th

11、eir content, based solely on their location in web graph structure Higher quality search results central, important, and authoritative web pages are given preference Help find representative pages to display for a cluster center Other applications: traffic estimation, back-link predictor, user navig

12、ation, personalized page rank Mining structure of web graph is very useful for various information retrieval,HITS Method,Hyperlink Induced Topic Search Kleinberg, 1998 A simple approach by finding hubs and authorities View web as a directed graph Assumption: if document A has hyperlink to document B

13、, then the author of document A thinks that document B contains valuable information,Main Ideas,Concerned with the identification of the most authoritative, or definitive, Web pages on a broad-topic Focused on only one topic Viewing the Web as a graph A purely link structure-based computation, ignor

14、ing the textual content,HITS: Hubs and Authority,Hub: web page links to a collection of prominent sites on a common topic Authority: Pages that link to a collection of authoritative pages on a broad topic; web page pointed to by hubs Mutual Reinforcing Relationship: a good authority is a page that i

15、s pointed to by many good hubs, while a good hub is a page that points to many good authorities,Hub-Authority Relations,HITS: Two Main Steps,A sampling component, which constructs a focused collection of several thousand web pages likely to be rich in relevant authorities A weight-propagation compon

16、ent, which determines numerical estimates of hub and authority weights by an iterative procedure As the result, pages with highest weights are returned as hubs and authorities for the research topic,HITS: Root Set and Base Set,Using query term to collect a root set (S) of pages from index-based sear

17、ch engine (AltaVista) Expand root set to base set (T) by including all pages linked to by pages in root set and all pages that link to a page in root set (up to a designated size cut-off) Typical base set contains roughly 1000-5000 pages,Step 1: Constructing Subgraph,1.1 Creating a root set (S) - Gi

18、ven a query string on a broad topic - Collect the t highest-ranked pages for the query from a text-based search engine 1.2 Expanding to a base set (T) - Add the page pointing to a page in root set - Add the page pointed to by a page in root set,Root Set and Base Set (Contd),Step 2: Computing Hubs an

19、d Authorities,2.1 Associating weights - Authority weight xp - Hub weight yp - Set all values to a uniform constant initially 2.2 Updating weights,Updating Authority Weight,xp=yq1+yq2+yq3,Example,Updating Hub Weight,Example,yp=xq1+xq2+xq3,Flowchart,Results,All x- and y-values converge rapidly so that

20、 termination of the iteration is guaranteed It can be proved in mathematical approach Pages with the highest x-values are viewed as the best authorities, while pages with the highest y-values are regarded as the best hubs,Implementation,Search engine: AltaVista Root set: 200 pages Base set: 1000-500

21、0 pages Converging speed: Very rapid, less than 20 times Running time: About 30 minutes,HITS: Advantages,Weight computation is an intrinsic feature from collection of linked pages Provides a densely linked community of related authorities and hubs Pure link-based computation once the root set has be

22、en assembled, with no further regard to query terms Provides surprisingly good search result for a wide range of queries,Drawbacks,Limit On Narrow Topics Not enough authoritative pages Frequently returns resources for a more general topic adding a few edges can potentially change scores considerably Topic Drifting - Appear when hubs discuss multiple topics,To improve precision: - Combining content with link

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论