pyramid之架构与应用_第1页
pyramid之架构与应用_第2页
pyramid之架构与应用_第3页
pyramid之架构与应用_第4页
pyramid之架构与应用_第5页
已阅读5页,还剩33页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

1、Pyramid:the distributed infrastructure阳振坤2009-06-17 Why distributed system?Massive data sizeTrillions of web pages indexed by Baidu/GoogleThousands of terabytes of logs.Challenges from machine failurePractically no fault-free machinesInflexibility of manual partitioning data among machinesWhy NOT Ha

2、doop?Baidu goes toe to toe with GoogleHuge gaps between Hadoop and Google on availability (fault tolerance), performance, scalability, featuresSingle point failureReal time fault tolerance, data/task migrationConcurrent appendingAutomatic rebalanceMaintenanceCopy-on-write + instant snapshot/checkpoi

3、ntB-tree + pressed file namespaceWhy will Pyramid beat Google?Standing on the shoulders of Google etcNo legacy burdensBetter infrastructureBetter design to fit modern machine metadataHigher localization rateWhat can Pyramid do for you?Scaling up to thousands of machines30GB/s and 3GB/s read/write pe

4、rformance in a 100-machine clusterAutomatic data partition among machinesTransparent machine failure handlingFailed machine is automatically detected and removedData and task are migrated to other machines automatically and quicklyMachine failure is transparent to appsAutomatic parallelizationNo par

5、allel or distributed system background are requiredDynamic load balance among machinesLow operation and maintenance costPyramids infrastructureDCS: Distributed Computing SystemDTS: Distributed Table SystemDFS: Distributed File SystemTo simplify design and implementationFor stage resultsDFSDTSDCSDFSs

6、 goalCapable of storing thousands of TB of dataHigh and sustainable aggregate IO bandwidthHundreds of GB/s read performanceTens of GB/s write performanceUninterruptible serviceBuilt-in fault tolerance and high availabilityAutomatic machine managementPlug and playDFSs infrastructureSingle master + ma

7、ny workersFiles: divided into fixed-size chunks (256MB) stored by workersChunks: replicated on multiple workersDFSDTSDCSDM333444555666777888000222111Infrastructure of DTSSingle DTS master + many workersSorted and partitioned by row keyEach partition is about 256MBPartitions can be split or merged du

8、e to insertion or deletionB+ tree hierarchyDTSs goalUp to trillions of rows and billions of columnsSorted and partitioned by row keyDistributed among hundreds or thousands of machinesEach machine serves a few to a few thousand partitions Queries are done by corresponding machinesDFSDTSDCSDCSs goalA

9、restricted programming model (Map/Reduce)Widely adaptableAutomatically parallel and transparently fault-tolerantA distributed execution frameworkPartitioning job to tasksScheduling tasks among machinesDealing with machine failureBenefits to engineersAutomatic parallelization Transparent fault-tolera

10、nceBoth native C+ and scriptNo experience on parallel or distributed system requiredDFSDTSDCSMap & reduce prototypeMap: (k1,v1) = list (k2, v2)k1, v1 and k2, v2 are all binary stringsUsually (k2,v2) are drawn from a different domain than the inputWritten by the userReduce: (k2, v2) = list(v)Both k2

11、and v2 are stringsUsually v is draw from the same domain as the inputWritten by the userHow MapReduce worksMapReduce transforms and rearranges dataMap 1:123RMap 2:123RMap 3:123R123R123R123R123RMap M:123RInput data for map #1Input data for map #2Input data for map #3Input data for map #MMapReduce dia

12、gramRead dataMap: extract something from each recordShuffle and SortReduce: aggregate, summarize, filter, or transformWrite the resultsScheduling & load balancingJob splittingM map tasksR Reduce tasksTask assignmentThe master obtains a machine pool from the underlying systemA task is assigned to a w

13、orker whenever it es idleData shuffling begins before all map tasks are doneReduce tasks will not start until all map tasks are doneDealing with stragglersA few stragglers are commonBad disksResource consumed by other jobsBackup taskStarting pointResult arbitrationCommutative and associative propert

14、yTask control policyEnable/disable backup tasksLimit map tasks number per machine during reduce timeCountering against failed workersDone/undergoing map tasks on failed worker Done/undergoing reduce tasks on failed workerMap 1:123RMap 2:123RMap 3:123R123R123R123R123RMap M:123RMap worker of word coun

15、tingRetrieve input data from DFS fileInterpret input data (text)Call map() to retrieve each word and emit(word, “1”)Deliver the emitted pair to corresponding reduce buckets: md5(key) mod R, R: number of reduce tasksSort each bucket and combine duplicated to and emitKeep final results to local disk a

16、s temporary fileReduce worker of word countingRetrieve corresponding bucket (set of ) from each map worker through network (shuffle) Sort whole retrieved data by keyExternal sort is often invokedCall reduce() to sum up under the same “word” and emit Process the emitted content by output_format_tDeli

17、ver final results to output_target_t (DFS file)Map worker of distributed sortRetrieve input data through input_source_t: e.g., DFS fileInterpret input data according to input_format_t: e.g., textCall user defined map() functionMap() retrieve every and emit themCollect pairs emitted by map()Deliver t

18、hese pairs to corresponding reduce bucket according to the partitioner_te.g., R-1 separator key strings, R: the number of reduce tasksKeep final results to local disk as temporary filestr1str2str(R-1)123RReduce worker of distributed sortRetrieve corresponding bucket (set of pairs) from each map work

19、er through network (shuffle) Sort whole retrieved data by keyExternal sort is often invokedCall user defined reduce() function: emit(key, value)Collect results emitted by reduce() Process the above results according to output_format_tDeliver final results to output_target_tstr1str2str(R-1)123RHow ma

20、p worksMap collects its output by bucketMulti-way mergingKept as local temporary files123R213R31232R123R123R512MB512MB512MBHow reduce worksReduce shuffles data from each map workerEach shuffle implies a disk read on a map machineReduce sorts its input dataMulti-way mergingOften assign reduce worker

21、more memoryMap 1:3Map 2:3Map 3:3333Map M:3A3,F3,B3,U3C3,K3,A3,A3,V3U3,C3,B3,L3V3,K3,L3,A31GBA3,F3,B3,U3,C3,K31GBA3,A3,V3,U3,C3,B31GBL3,1GBBatch lookup from a tableAssumptionsA large table (e.g., tens of GBs or more)A relatively large volume of source keys (a few GBs or more)TargetFor every key in th

22、e source, retrieve its value in the tableCan MapReduce offer a solution to it?One solutionInput sources: the table and the source keysThe map() functionFor the table input, emit(key+0, value)For the source keys input, emit(key+1, null)The reduce() functionFor , store key and value to member variable

23、s of the classFor , compare current key with stored key, emit if equal, emit otherwiseHow to make these two items in the same bucketPartition(key+x) = md5(key) mod RA few more MapReduce examplesSobar log analysis at 2008.11First DCS appRetrieve source data by another DCS app firstLinkuniq at 2008.12

24、1TB+ single reduce task input before applying combinerWebinfodb backup: retrieve and sortMap worker: retrieve webpage from webinfodbReduce worker: gunzip and press webpage by lsrPartitioner: a list of urlsASP log: retrieve and split by products PLSA30M docs, 700K words and 300/1000 topicsChallenges

25、during developmentComplexity500,000+ C+ code linesAsynchronous RPC callsFault toleranceFailed/slow machinesLarge clock skewTransient failed DNSMomentary network breakUndetected network transferring errorLost、disordered、delayed network packetsRoadmap to PyramidPreparationDFS designDFS development & t

26、estDCS designDCS development & testDTS designDTS development & test200720082009Q3Q4Q1Q2Q3Q4Q1Q2Q3Q4TodayDFS trial appDCS trial appDTS trial appReferencesReference:Pyramid Wiki: DCS SDK: Q & ADFS WorkerData are divided into chunks (256MB)Chunks are distributed among workersChunks are replicated on mu

27、ltiple workers for reliability as well as performanceClients communicates with workers for data directlyDFS MasterKeeps all metadata in memoryFile namespaceChunk namespaceLogs all metadata mutationsSends/receives heartbeat/response to/from all workersDetects machine failure and migrates chunks autom

28、aticallyBalances load between workers automaticallyShadow masters to counter against master failureClients communicate with the master for metadata onlyDFS Master failurePrimary master + shadow master(s)A shadow master keeps itself synchronized by replaying commit log produced by the primary masterA

29、 shadow master lags the primary master, typically fractions of a secondA shadow master es the primary after the latter failsDFS: Performance considerationsLarge chunk sizeChunk replicas distribution policyConcurrent appendingNearest network data transferringMasters batch commit logMasters in-memory

30、metadata policyMasters copy-on-write metadata structure10:1 read/write performance gapDFS: 3 replicasReliability and PerformanceDCS: The partition functionDefault partitionermd5(key) mod RMetamorphosis: Hash(f(key)Hash(url) = f(site(url)Hash(url) = f(domain(url)Hash(key+x) = f(key)Other partitioners

31、SeparatorsHow a worker performs a mutationA worker always writes a mutation to a memory patchReads and writes can be proceeded in parallel by adopting copy-on-write techniqueThe in-memory portion is written to disk and reset to empty when its size reaches a thresholdThe original tablet and its disk patches are immutable to accelerate reading, writing and splittingThe tablet is rew

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论