已阅读5页,还剩32页未读, 继续免费阅读
版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
CS345DataMining,LinkAnalysisAlgorithmsPageRank,AnandRajaraman,JeffreyD.Ullman,LinkAnalysisAlgorithms,PageRankHubsandAuthoritiesTopic-SpecificPageRankSpamDetectionAlgorithmsOtherinterestingtopicswewontcoverDetectingduplicatesandmirrorsMiningforcommunities,Rankingwebpages,Webpagesarenotequally“important”Ihas23,400inlinkswww.joe-has1inlinkAreallinlinksequal?Recursivequestion!,Simplerecursiveformulation,EachlinksvoteisproportionaltotheimportanceofitssourcepageIfpagePwithimportancexhasnoutlinks,eachlinkgetsx/nvotesPagePsownimportanceisthesumofthevotesonitsinlinks,Simple“flow”model,Thewebin1839,y,a,m,y/2,y/2,a/2,a/2,m,y=y/2+a/2a=y/2+mm=a/2,Solvingtheflowequations,3equations,3unknowns,noconstantsNouniquesolutionAllsolutionsequivalentmoduloscalefactorAdditionalconstraintforcesuniquenessy+a+m=1y=2/5,a=2/5,m=1/5Gaussianeliminationmethodworksforsmallexamples,butweneedabettermethodforlargegraphs,Matrixformulation,MatrixMhasonerowandonecolumnforeachwebpageSupposepagejhasnoutlinksIfj!i,thenMij=1/nElseMij=0MisacolumnstochasticmatrixColumnssumto1SupposerisavectorwithoneentryperwebpageriistheimportancescoreofpageiCallittherankvector|r|=1,Example,Supposepagejlinksto3pages,includingi,r,Eigenvectorformulation,Theflowequationscanbewrittenr=MrSotherankvectorisaneigenvectorofthestochasticwebmatrixInfact,itsfirstorprincipaleigenvector,withcorrespondingeigenvalue1,Example,y1/21/20a1/201m01/20,yam,y=y/2+a/2a=y/2+mm=a/2,PowerIterationmethod,Simpleiterativescheme(akarelaxation)SupposethereareNwebpagesInitialize:r0=1/N,.,1/NTIterate:rk+1=MrkStopwhen|rk+1-rk|1|x|1=1iN|xi|istheL1normCanuseanyothervectornorme.g.,Euclidean,PowerIterationExample,y1/21/20a1/201m01/20,yam,ya=m,1/31/31/3,1/31/21/6,5/121/31/4,3/811/241/6,2/52/51/5,.,RandomWalkInterpretation,ImaginearandomwebsurferAtanytimet,surferisonsomepagePAttimet+1,thesurferfollowsanoutlinkfromPuniformlyatrandomEndsuponsomepageQlinkedfromPProcessrepeatsindefinitelyLetp(t)beavectorwhoseithcomponentistheprobabilitythatthesurferisatpageiattimetp(t)isaprobabilitydistributiononpages,Thestationarydistribution,Whereisthesurferattimet+1?Followsalinkuniformlyatrandomp(t+1)=Mp(t)Supposetherandomwalkreachesastatesuchthatp(t+1)=Mp(t)=p(t)Thenp(t)iscalledastationarydistributionfortherandomwalkOurrankvectorrsatisfiesr=MrSoitisastationarydistributionfortherandomsurfer,ExistenceandUniqueness,Acentralresultfromthetheoryofrandomwalks(akaMarkovprocesses):Forgraphsthatsatisfycertainconditions,thestationarydistributionisuniqueandeventuallywillbereachednomatterwhattheinitialprobabilitydistributionattimet=0.,Spidertraps,AgroupofpagesisaspidertrapiftherearenolinksfromwithinthegrouptooutsidethegroupRandomsurfergetstrappedSpidertrapsviolatetheconditionsneededfortherandomwalktheorem,Microsoftbecomesaspidertrap,Yahoo,Msoft,Amazon,y1/21/20a1/200m01/21,yam,ya=m,111,11/23/2,3/41/27/4,5/83/82,003,.,Randomteleports,TheGooglesolutionforspidertrapsAteachtimestep,therandomsurferhastwooptions:Withprobability,followalinkatrandomWithprobability1-,jumptosomepageuniformlyatrandomCommonvaluesforareintherange0.8to0.9Surferwillteleportoutofspidertrapwithinafewtimesteps,Randomteleports(=0.8),Yahoo,Msoft,Amazon,1/2,1/2,0.8*1/2,0.8*1/2,0.2*1/3,0.2*1/3,0.2*1/3,1/21/201/20001/21,1/31/31/31/31/31/31/31/31/3,y7/157/151/15a7/151/151/15m1/157/1513/15,0.8,+0.2,Randomteleports(=0.8),Yahoo,Msoft,Amazon,1/21/201/20001/21,1/31/31/31/31/31/31/31/31/3,y7/157/151/15a7/151/151/15m1/157/1513/15,0.8,+0.2,ya=m,111,1.000.601.40,0.840.601.56,0.7760.5361.688,7/115/1121/11,.,Matrixformulation,SupposethereareNpagesConsiderapagej,withsetofoutlinksO(j)WehaveMij=1/|O(j)|whenj!iandMij=0otherwiseTherandomteleportisequivalenttoaddingateleportlinkfromjtoeveryotherpagewithprobability(1-)/Nreducingtheprobabilityoffollowingeachoutlinkfrom1/|O(j)|to/|O(j)|Equivalent:taxeachpageafraction(1-)ofitsscoreandredistributeevenly,PageRank,ConstructtheNNmatrixAasfollowsAij=Mij+(1-)/NVerifythatAisastochasticmatrixThepagerankvectorristheprincipaleigenvectorofthismatrixsatisfyingr=ArEquivalently,risthestationarydistributionoftherandomwalkwithteleports,Deadends,Pageswithnooutlinksare“deadends”fortherandomsurferNowheretogoonnextstep,Microsoftbecomesadeadend,Yahoo,Msoft,Amazon,ya=m,111,10.60.6,0.7870.5470.387,0.6480.4300.333,000,.,1/21/201/20001/20,1/31/31/31/31/31/31/31/31/3,y7/157/151/15a7/151/151/15m1/157/151/15,0.8,+0.2,Dealingwithdead-ends,TeleportFollowrandomteleportlinkswithprobability1.0fromdead-endsAdjustmatrixaccordinglyPruneandpropagatePreprocessthegraphtoeliminatedead-endsMightrequiremultiplepassesComputepagerankonreducedgraphApproximatevaluesfordeadendsbypropagatingvaluesfromreducedgraph,Computingpagerank,Keystepismatrix-vectormultiplicationrnew=AroldEasyifwehaveenoughmainmemorytoholdA,rold,rnewSayN=1billionpagesWeneed4bytesforeachentry(say)2billionentriesforvectors,approx8GBMatrixAhasN2entries1018isalargenumber!,Rearrangingtheequation,r=Ar,whereAij=Mij+(1-)/Nri=1jNAijrjri=1jNMij+(1-)/Nrj=1jNMijrj+(1-)/N1jNrj=1jNMijrj+(1-)/N,since|r|=1r=Mr+(1-)/NNwherexNisanN-vectorwithallentriesx,Sparsematrixformulation,Wecanrearrangethepagerankequation:r=Mr+(1-)/NN(1-)/NNisanN-vectorwithallentries(1-)/NMisasparsematrix!10linkspernode,approx10NentriesSoineachiteration,weneedto:Computernew=MroldAddaconstantvalue(1-)/Ntoeachentryinrnew,Sparsematrixencoding,EncodesparsematrixusingonlynonzeroentriesSpaceproportionalroughlytonumberoflinkssay10N,or4*10*1billion=40GBstillwontfitinmemory,butwillfitondisk,sourcenode,degree,destinationnodes,BasicAlgorithm,AssumewehaveenoughRAMtofitrnew,plussomeworkingmemoryStoreroldandmatrixMondiskBasicAlgorithm:Initialize:rold=1/NNIterate:Update:PerformasequentialscanofMandroldtoupdaternewWriteoutrnewtodiskasroldfornextiterationEveryfewiterations,compute|rnew-rold|andstopifitisbelowthresholdNeedtoreadinbothvectorsintomemory,Updatestep,src,degree,destination,0,1,2,3,4,5,6,0,1,2,3,4,5,6,rnew,rold,Initializeallentriesofrnewto(1-)/NForeachpagep(out-degreen):Readintomemory:p,n,dest1,destn,rold(p)forj=1.n:rnew(destj)+=*rold(p)/n,Analysis,Ineachiteration,wehaveto:ReadroldandMWriternewbacktodiskIOCost=2|r|+|M|Whatifwehadenoughmemorytofitbothrnewandrold?Whatifwecouldnotevenfitrnewinmemory
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- 大学生的规划分析
- 2025年统计师之中级统计师工作实务题库附答案(典型题)
- 微生物专题知识讲座
- 2025汽车美容店承包合同范本
- 2025版设计合同范本:室内设计合同与建筑工程设计合同
- 2025建筑公司合同管理系统
- 2025物流公司承包合同模板
- 2025无固定期限合同并非等同于铁饭碗:打破传统就业观念的革新之路
- 2025企业合作合同赠予协议范本
- 2025年签订中外合作开发合同(有限责任)
- 葡萄生产教学课件
- 现代家政服务与管理专业教学标准(中等职业教育)2025修订
- 铁道交通运营管理专业职业生涯规划书5200字数
- (2025.06.12)领导干部任前应知应会党内法规和法律知识考试题库(2025年度)
- 2024陆上风电项目造价指标
- 监事会换届工作报告
- 2025年里牛蒡项目市场调查研究报告
- 第10课 相亲相爱一家人 课件-2024-2025学年道德与法治一年级下册统编版
- 工业4.0与智能制造
- 石英石台面供货合同协议
- 雀巢财务管理制度
评论
0/150
提交评论