下载本文档
版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
1、SparkGraphX 一章Spark rah_ SparkGraphX 一章Spark rah_ O章Table operatorGraph五章以g始式1edges1triplets操r 十五章Spark GraphX之 . 十七章Spark GraphX之 Gh十_VMiningEcommerceGraphDatawithSparkat.txt Q地数据o持4/: 4006-998-SparkGraphX 一章Spark GraphX_见Spark GraphX o一nV布式理n架,Spark GraphXSparkGraphX 一章Spark GraphX_见Spark GraphX o
2、一nV布式理n架,Spark GraphX Spark 对计 知,交网o人P人之间1 V,在Q面 Editor GraphrCommunity n程以之 Triangleion,o计O一n,l,会现一n区,从P面,势,现在n架_在往面,GraphLib 现了以Graph 5/SparkGraphX SparkGraphX _源计n架irph,以们gGpLb,w_W含A,wo PlAGrph BP BP p即m,BP计wywyBriBP计n架q特点o抽象了一批 API g化程,往往 Q能需Sparko持或Hadoopo持,GraphView种Q能,Q 所6/: 4006-998-SparkGra
3、phX P面所计理式oSparkGraphX P面所计理式o传计式,w现在了 Spark GraphX 之_ 理 StructureoO需变,果每变_无用,7/SparkGraphX hoprkSparkGraphX hoprk Spark GraphX oGraph woProperty Graph,_o说每n点3点rxin,o学stu.,5n点ofranlin,o一n V表 Property,点,对点言p自身 ID,对 在EdgeTableo3r7 表nProperty oCollaborators,2r5 Colleagues更oProperty Graph Table 之间o以转换8/
4、: 4006-998-SparkGraphX 章Spark GraphX计现SparkGraphX 章Spark GraphX计现WVp两种,一种o对行WV,一种o对点行GraphX用Vertex Cut,即对点行WV,GraphX在行WVYp几种 先定Oo 以看rRandomVertextCut o对源点ID 目标点ID 计hash 行g现9/SparkGraphX 种SparkGraphX 种 10/: 4006-998-SparkGraphX SparkGraphX 11/SparkGraphX SparkGraphX o表n了每npartitionoOedge,Spark GraphX
5、o一n非常棒地在 o 所以s们以用,nO变内结_pipeline 带gyw优势O章Table operator 12/: 4006-998-SparkGraphX 在 了 SparkGraphX 在 了 一n非常协_类13/SparkGraphX 章1edges1triplets oSpark GraphX oOng概SparkGraphX 章1edges1triplets oSpark GraphX oOng概点性,VertexRDD 源从源yo以看VertexRDDRDD(VertexId, VD) ,RDD类型VertexId VD,woVD o性类型,说 点性Edges 对o Edge
6、RDD,性pOn源点 ID1目标点ID1性, EdgeRDD 源yQ14/: 4006-998-SparkGraphX SparkGraphX Triplets性p源ID1源点性1性1目标点ID1目标点性 Triplets Edgesjoin操_Tripletsx_源点ID性1目标点 ID 性以自性,_o说 Triplets 把W含源点目标点性以自用Triplets2TripletsRDD类型oEdgeTriplet,w源y现15/SparkGraphX SparkGraphX 源点性tr g表目标点性,s们看一Qwv类 Edge 源代g标志目标点 ID1attr g表 edge 性,_Edg
7、e p五n性,前面VVertexRDD 只p两n性集 Vertex ID 16/: 4006-998-SparkGraphX 五章以g始式SparkGraphX 五章以g始式从源yos们以看rgraph先需 RDD,l 了 RDD(VertexId, VD),点o放uedges,On元o放u默认点性w以 null,需呀注oo默认点性只os们用,并O在面Qgo edgeStorageLevel vertexStorageLevel,它们oVggo说点并Oo在一 Cache ,默认uL.Y_Ygs们现一nGraph Y用GraphImpl g完17/SparkGraphX 章k写一nGraph 代
8、y并1edges1triplets SparkGraphX 章k写一nGraph 代y并1edges1triplets 操先在本地启 Spark suQ_u.apache.spark.graphxQ面内,g了便续RDD操_,s们导u18/: 4006-998-SparkGraphX SparkGraphX 对users nRDD 言,w每一n元W含一nID 性,性o由name 19/SparkGraphX SparkGraphX 目标点 ID 性等OVQgpn非常对象 defaultUser,w_用在果 ralationshipsoO存在目标点Y会用ndefaultUser,5r0n rala
9、tionship oO存在,那会默认g向 eflseodefaultUser 用途,能p朋说O defaultUser,_o以P代yoV 20/: 4006-998-SparkGraphX SparkGraphX Qg放u 21/SparkGraphX SparkGraphX x体的实现为GraphImplGraphImplapply方法,如下所22/: 4006-998-SparkGraphX 此时使用GraphImplSparkGraphX 此时使用GraphImpl23/SparkGraphX SparkGraphX 24/: 4006-998-SparkGraphX SparkGrap
10、hX s们看一Q occupation pst.doc.点数目,用Q代y即25/SparkGraphX 通计现只pUserSparkGraphX 通计现只pUserpt.ocl们看一QProperty Graph,以现o 点 7 nUser26/: 4006-998-SparkGraphX SparkGraphX 从o以现只p 53,w它osrcID_ 27/SparkGraphX 28/SparkGraphX 28/: 4006-998-SparkGraphX N EdgeTriplet 类型o EdgeTriplet(String, String),String呢SparkGraphX N
11、 EdgeTriplet 类型o EdgeTriplet(String, String),String呢,s们看一Q 29/SparkGraphX SparkGraphX 30/: 4006-998-SparkGraphX SparkGraphX 七章在 Spark 集P用件o数据graph 并行操一情Qs们数据在件o,说日志件o,Spark GraphX 了非常便从件o读数据g graph 口,n口o iile,O31/SparkGraphX 在源yos们需注oo edgeListFile On数 SparkGraphX 在源yos们需注oo edgeListFile On数 ion,op向
12、即 ion 值true,那N只p在源点ID目点 ID Yo一n,w源yQ所32/: 4006-998-SparkGraphX SparkGraphX 了 1,点P点之间V隔以o空格_以o tab 键,源yo现注 vertexStorageLevel默认MEMORY_ONLY式,数以据需行调Contest 数据,数据放hdfs 33/SparkGraphX 每一行内一Wo源网ISparkGraphX 每一行内一Wo源网I目标网 ls们再用Spark 34/: 4006-998-SparkGraphX 点_ID即SparkGraphX 点_ID即Spark35/SparkGraphX w通hdfs
13、oSparkGraphX w通hdfso 36/: 4006-998-SparkGraphX ls们现只p一nSparkGraphX ls们现只p一npartition在行数据完Yunweb-Googel.txt 存了内存之o,Q所以看r数据被存r了 edges2 Q面s们用minEdgePartitions 437/SparkGraphX SparkGraphX 38/: 4006-998-SparkGraphX SparkGraphX 39/SparkGraphX Qgs们查看一Qgweb-Googel.txtopSparkGraphX Qgs们查看一Qgweb-Googel.txtop点
14、40/: 4006-998-SparkGraphX 以现qp5105039g,模SparkGraphX 以现qp5105039g,模对做验言o非常理章在Spark 集P握操_Property 本V内_注集P Property Operator 内w 41/SparkGraphX 42/:SparkGraphX 42/: 4006-998-SparkGraphX SparkGraphX 43/SparkGraphX 以看rn10 n点元o每n点元SparkGraphX 以看rn10 n点元o每n点元性值o1,no源y定2 2变r3查看一Q变r 3 以结果44/: 4006-998-SparkGr
15、aphX SparkGraphX 果对行操_,本前面一,ls们把所p性变r 2 并查看w执行结果45/SparkGraphX SparkGraphX Qgs们操_用一Q rile,子o把每n元Edge 性值 46/: 4006-998-SparkGraphX 用tmp.triplets.take(10)g查看一Q结果需注oo dges SparkGraphX 用tmp.triplets.take(10)g查看一Q结果需注oo dges mapTriplets 在执行操_Y会持内 Structural indi_章在Spark集P握操_Structural rvreugrapms147/Spar
16、kGraphX _用 SparkGraphX _用 先看一Q 48/: 4006-998-SparkGraphX 看一Q SparkGraphX 看一Q urh,os们验证一Qsubgraph 结果49/SparkGraphX 执行结果o表n源点SparkGraphX 执行结果o表n源点 ID o目标点ID 2 l现点n数旧o875713n,o一n数 所以通g件Qsubgraph 减了250/: 4006-998-SparkGraphX Qgs们_u对点子SparkGraphX Qgs们_u对点子数据 _用Q行了减l用 从结果P以现O存在点ID1000000情51/SparkGraphX 十章
17、在 Spark 集P握操SparkGraphX 十章在 Spark 集P握操_Computing V Degree o散数学概,在 Spark GraphX o把 Degree Degree oGraphOpsor,源yQ所向vYoEdgeDirection g定,w源yQ所52/: 4006-998-SparkGraphX Q面s们看一Q SparkGraphX Q面s们看一Q 用tmp.take(10) 查看一Qowo 10 53/SparkGraphX SparkGraphX w结果p54/: 4006-998-SparkGraphX s们用 graph.inDegrees.reduce
18、(max) 看一Q那n节点SparkGraphX s们用 graph.inDegrees.reduce(max) 看一Q那n节点l现o ID 537039 点,果o一张网,表no一n质量非常 十一章Spark 集P握操_Collecting 计p两ncollectNeighborIds collectNeighbors,源55/SparkGraphX 从P两n源y以看dSparkGraphX 从P两n源y以看dDecoothoO被o持,一点在56/: 4006-998-SparkGraphX Join JoinSparkGraphX Join Join Operators o非常操_, wp两
19、n 会_用所p,wo 只会_用Qg演 57/SparkGraphX s们通.take(10)看一Q执行结果SparkGraphX s们通.take(10)看一Q执行结果了 Q面s们执行一通.take(10)查看一Q结果Qg看一Q 通.take(20)g查看一Q执行结果58/: 4006-998-SparkGraphX 十O章Spark SparkGraphX 十O章Spark 集P握操_Map Reduce 在 59/SparkGraphX SparkGraphX 用 l假s们把点ID做用户,s们计所pn用户 60/: 4006-998-SparkGraphX P面SparkGraphX P面
20、代yo需注o一点osdmessage Iterator, 对象61/SparkGraphX SparkGraphX Pregel 在 GraphX 面,Graph 张并没p自 Cache,ok Cache,Oo在每k代o了更需k做 cache,每k代完需把没pr把p用 非常适合用Pregel g做62/: 4006-998-SparkGraphX SparkGraphX 63/SparkGraphX 64/: SparkGraphX 64/: 4006-998-SparkGraphX Qgs们Pregel APIg做一n子,_车SparkGraphX Qgs们Pregel APIg做一n子,_
21、车朋知地导 w先s们定OsourceId,QQgs们通 65/SparkGraphX SparkGraphX 66/: 4006-998-SparkGraphX 十章Spark GraphX非常SparkGraphX 十章Spark GraphX非常 用 o即在O了 代yQ所示 ,w67/SparkGraphX 以uSpark自带SparkGraphX 以uSpark自带68/: 4006-998-SparkGraphX _一种用,SparkGraphX _一种用,s们旧.txt数据s们以注or在用 Y需传u一n数,传un值_ 69/SparkGraphX SparkGraphX 注人_s注,
22、s注so会pO,n区稳定,联紧果说只oo一n人s注人,说 70/: 4006-998-SparkGraphX 从源yo以非常o晰看果行 Triangle 计,需持 sourceId _SparkGraphX 从源yo以非常o晰看果行 Triangle 计,需持 sourceId _destId所以ls们GraphYg定 GraphLoader.edgeListFile 十V淘对Spark GraphX模据淘网数据挖掘计 布数 消ok历,所p点将wr消o向点再转一k,值1g计所 p点P,wr值 1 ID,并行Vo总,r所p点= 71/SparkGraphX 十_VMining Ecommerce
23、 Graph with Spark at Notice:ThisarticlecomesThisisa guest tfromourfriendsat.SparkGraphX 十_VMining Ecommerce Graph with Spark at Notice:ThisarticlecomesThisisa guest tfromourfriendsat.operates one of the worlds We merce dredsofpetabytesofdataonthisplatformanduseSb
24、ably le,someSparkjobssomeofthelargestSparkheworld.Forkstoperformfeatureextractiononpetabytesofimagehist, we share our experience with Spark and GraphX from prototype productionattheData MiningEvery dredsofmillionsofusersanderacts marketplace. eractionscanbeexpressedcomplicated, large scale graphs. M
25、ining data requires a distributed singt cansupport eractive queries as well sophisticated SparkandGraphXembedastandardsetofgraphminingincluding The , triangle counting, connected components,shortest ionofthesealgorithmsfocusesonreusability.Usersimplementvariantsof thesealgorithmsinordertoexploitopti
26、mizationopportunitiesforspecificworkloads.Inourexperience,thewaytolearnGraphXistoreadandunderstandthesourcecodeofthese ostartedwithGraphXinSpark0.9anday2014aroundthetSpark1.0was72/: 4006-998-SparkGraphX OnethingtonotetGraphXisstillevolvingquickly.Althoughuser-facingAPIsarerelativelystable,ernalshave
27、seenfairlyrefactoringandimprovementsfrom0.8to1.0.SparkGraphX OnethingtonotetGraphXisstillevolvingquickly.Althoughuser-facingAPIsarerelativelystable,ernalshaveseenfairlyrefactoringandimprovementsfrom0.8to1.0.Basedonourexperience,upgradeprovided1020%performanceimprovementsminorwithoutmodifyingourappli
28、cationGraph Inspection Graph-basedstructuresthemanyrelationshipsnouranditemsinourstore.Ourbusinessandproductteamsconstantlyneedmakesbasedonthevalueandhealthofeachrelationship.Spark,theyuseduitiontoestimatesuchproperties,resultingswhichwerenota goodfitwithreality.Tosolvethisproblem,developedaplatform
29、tofyallthesemetricsinorderprovideevidenceandinsightsforproductThis platform requires constantly re-iterating the set ofmetrics it to users, dependingon product demand.eractive nature of both andGraphXprovesveryvaluableinbuildingthisplatform.SomeofthethisplatformmeasuresDegreeDistribution:Degreedistr
30、ibutionmeasuresthedistributionvertexdegrees(e.g.howmanyusershave50friends).Italsovaluableinformationonthenumberofhighdegree(so-called).Oftenourroductinfrastructureneedssuperinalmanner(becausetheyhaveahighimpactpropagationalgorithms),andthusitisltounderstandtheiramongourdata.GraphXsVertexRDDprovidesb
31、uilt-insupportforSecondDegreeNeighbors:inglrelationshipsoftenmeasuringthesecond-degreeneighbordistribution.Forle,inanmessagingplatformwedeveloped,thenumberofts= correlates thenumberofseconddegreeneighbors(e.g.numberoffriendsofWhileGraphXdoesnotyetprovidebuilt-insupportforcountingsecond73/SparkGraphX
32、 neighbors,weimplementeditpropagates each vertexs ID to its neighbors, and the second round propagatesallIDsfromneighborstoseconddegreeneighbors.Afterthetwo rounds of propagations, each SparkGraphX neighbors,weimplementeditpropagates each vertexs ID to its neighbors, and the second round propagatesa
33、llIDsfromneighborstoseconddegreeneighbors.Afterthetwo rounds of propagations, each vertex calculates the number of second neighbors using a hash Onethingtohis calculation tweusethefromtheseconddegreeneighbor would createtoomanydegree distribution to remove super calculation. Otherwise, these super l
34、eading to high ionskewandhighmemoryConnectedComponents:Connectedcomponentsrefertosomesettareconnected=,i.e.thereexistsapathconnectinganypairlargehesubgraph.Connectedcomponentisveryusefulindividingo multiple, smaller graphs, andthen tionally too expensive to run on the large graph. This algorithm als
35、o be adapted to discover tightly connectedWe are more metrics using both built-in functions provided Spark and GraphX, as well as new ones ernally. This nurturesa newculture tourproductsarenolongerbased instinct uition,butratheronmetricsminedfromMulti-graph The Graph Inspection Platform provides us
36、with different properties ing relationships. Each relationship structure has its own strengths mweaknesses. For le,somerelationshipstructureprovidesmore information in connected components, whileanotherotherstructurework better eractions. We often make sbasedonionsofthesamegraph.BasedonGraphX,wedeve
37、lopedamulti-graphmergingtcreates ersections= of multiple 74/: 4006-998-SparkGraphX Theattachedfigureillustratesthealgorithm tomergegraphAandgraphtocreategraphC:edgesarecreatedingraphSparkGraphX Theattachedfigureillustratesthealgorithm tomergegraphAandgraphtocreategraphC:edgesarecreatedingraphCifanyofitsexistgraphraphoperator provided by GraphX. In addition to naively ographs,framework so assign different weights to the input graphs. In ysis pipelinesoften merge multiple graphs in a
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- 富源供电局常态安全培训课件
- 家长食品安全培训课件
- 2026年装修工程借款合同书范本
- 2026年视频广告投放效果评估合同协议
- 解除2026年销售合同协议
- 2026年商场油烟管道专业维护合同
- 2026年化妆品代理销售保密合同
- 2026年软件系统开发授权合同
- 2026年物流管理培训合同
- 2026年建筑外墙涂料合同
- 2024年中国诚通控股集团有限公司所出资企业招聘真题
- DB37-T4975-2025分布式光伏直采直控技术规范
- 画框制作合同范本
- 2025年河北邯郸武安市公开招聘食品检测专业技术人员4名备考考试题库及答案解析
- 反霸凌宣传课件
- 民航空管局面试题及答案
- 2026年海南卫生健康职业学院单招综合素质考试题库参考答案详解
- 挡土墙设计相关规范及技术要点
- 2025年10月自考14701中外服装史(本).试题及答案
- 2024年广东省春季高考(学考)语文真题(试题+解析)
- 陕西省专业技术人员继续教育2025公需课《党的二十届三中全会精神解读与高质量发展》20学时题库及答案
评论
0/150
提交评论