




版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
1、Pyramid:the distributed infrastructure阳振坤2009-06-17 Why distributed system?Massive data sizeTrillions of web pages indexed by Baidu/GoogleThousands of terabytes of logs.Challenges from machine failurePractically no fault-free machinesInflexibility of manual partitioning data among machinesWhy NOT Ha
2、doop?Baidu goes toe to toe with GoogleHuge gaps between Hadoop and Google on availability (fault tolerance), performance, scalability, featuresSingle point failureReal time fault tolerance, data/task migrationConcurrent appendingAutomatic rebalanceMaintenanceCopy-on-write + instant snapshot/checkpoi
3、ntB-tree + pressed file namespaceWhy will Pyramid beat Google?Standing on the shoulders of Google etcNo legacy burdensBetter infrastructureBetter design to fit modern machine metadataHigher localization rateWhat can Pyramid do for you?Scaling up to thousands of machines30GB/s and 3GB/s read/write pe
4、rformance in a 100-machine clusterAutomatic data partition among machinesTransparent machine failure handlingFailed machine is automatically detected and removedData and task are migrated to other machines automatically and quicklyMachine failure is transparent to appsAutomatic parallelizationNo par
5、allel or distributed system background are requiredDynamic load balance among machinesLow operation and maintenance costPyramids infrastructureDCS: Distributed Computing SystemDTS: Distributed Table SystemDFS: Distributed File SystemTo simplify design and implementationFor stage resultsDFSDTSDCSDFSs
6、 goalCapable of storing thousands of TB of dataHigh and sustainable aggregate IO bandwidthHundreds of GB/s read performanceTens of GB/s write performanceUninterruptible serviceBuilt-in fault tolerance and high availabilityAutomatic machine managementPlug and playDFSs infrastructureSingle master + ma
7、ny workersFiles: divided into fixed-size chunks (256MB) stored by workersChunks: replicated on multiple workersDFSDTSDCSDM333444555666777888000222111Infrastructure of DTSSingle DTS master + many workersSorted and partitioned by row keyEach partition is about 256MBPartitions can be split or merged du
8、e to insertion or deletionB+ tree hierarchyDTSs goalUp to trillions of rows and billions of columnsSorted and partitioned by row keyDistributed among hundreds or thousands of machinesEach machine serves a few to a few thousand partitions Queries are done by corresponding machinesDFSDTSDCSDCSs goalA
9、restricted programming model (Map/Reduce)Widely adaptableAutomatically parallel and transparently fault-tolerantA distributed execution frameworkPartitioning job to tasksScheduling tasks among machinesDealing with machine failureBenefits to engineersAutomatic parallelization Transparent fault-tolera
10、nceBoth native C+ and scriptNo experience on parallel or distributed system requiredDFSDTSDCSMap & reduce prototypeMap: (k1,v1) = list (k2, v2)k1, v1 and k2, v2 are all binary stringsUsually (k2,v2) are drawn from a different domain than the inputWritten by the userReduce: (k2, v2) = list(v)Both k2
11、and v2 are stringsUsually v is draw from the same domain as the inputWritten by the userHow MapReduce worksMapReduce transforms and rearranges dataMap 1:123RMap 2:123RMap 3:123R123R123R123R123RMap M:123RInput data for map #1Input data for map #2Input data for map #3Input data for map #MMapReduce dia
12、gramRead dataMap: extract something from each recordShuffle and SortReduce: aggregate, summarize, filter, or transformWrite the resultsScheduling & load balancingJob splittingM map tasksR Reduce tasksTask assignmentThe master obtains a machine pool from the underlying systemA task is assigned to a w
13、orker whenever it es idleData shuffling begins before all map tasks are doneReduce tasks will not start until all map tasks are doneDealing with stragglersA few stragglers are commonBad disksResource consumed by other jobsBackup taskStarting pointResult arbitrationCommutative and associative propert
14、yTask control policyEnable/disable backup tasksLimit map tasks number per machine during reduce timeCountering against failed workersDone/undergoing map tasks on failed worker Done/undergoing reduce tasks on failed workerMap 1:123RMap 2:123RMap 3:123R123R123R123R123RMap M:123RMap worker of word coun
15、tingRetrieve input data from DFS fileInterpret input data (text)Call map() to retrieve each word and emit(word, “1”)Deliver the emitted pair to corresponding reduce buckets: md5(key) mod R, R: number of reduce tasksSort each bucket and combine duplicated to and emitKeep final results to local disk a
16、s temporary fileReduce worker of word countingRetrieve corresponding bucket (set of ) from each map worker through network (shuffle) Sort whole retrieved data by keyExternal sort is often invokedCall reduce() to sum up under the same “word” and emit Process the emitted content by output_format_tDeli
17、ver final results to output_target_t (DFS file)Map worker of distributed sortRetrieve input data through input_source_t: e.g., DFS fileInterpret input data according to input_format_t: e.g., textCall user defined map() functionMap() retrieve every and emit themCollect pairs emitted by map()Deliver t
18、hese pairs to corresponding reduce bucket according to the partitioner_te.g., R-1 separator key strings, R: the number of reduce tasksKeep final results to local disk as temporary filestr1str2str(R-1)123RReduce worker of distributed sortRetrieve corresponding bucket (set of pairs) from each map work
19、er through network (shuffle) Sort whole retrieved data by keyExternal sort is often invokedCall user defined reduce() function: emit(key, value)Collect results emitted by reduce() Process the above results according to output_format_tDeliver final results to output_target_tstr1str2str(R-1)123RHow ma
20、p worksMap collects its output by bucketMulti-way mergingKept as local temporary files123R213R31232R123R123R512MB512MB512MBHow reduce worksReduce shuffles data from each map workerEach shuffle implies a disk read on a map machineReduce sorts its input dataMulti-way mergingOften assign reduce worker
21、more memoryMap 1:3Map 2:3Map 3:3333Map M:3A3,F3,B3,U3C3,K3,A3,A3,V3U3,C3,B3,L3V3,K3,L3,A31GBA3,F3,B3,U3,C3,K31GBA3,A3,V3,U3,C3,B31GBL3,1GBBatch lookup from a tableAssumptionsA large table (e.g., tens of GBs or more)A relatively large volume of source keys (a few GBs or more)TargetFor every key in th
22、e source, retrieve its value in the tableCan MapReduce offer a solution to it?One solutionInput sources: the table and the source keysThe map() functionFor the table input, emit(key+0, value)For the source keys input, emit(key+1, null)The reduce() functionFor , store key and value to member variable
23、s of the classFor , compare current key with stored key, emit if equal, emit otherwiseHow to make these two items in the same bucketPartition(key+x) = md5(key) mod RA few more MapReduce examplesSobar log analysis at 2008.11First DCS appRetrieve source data by another DCS app firstLinkuniq at 2008.12
24、1TB+ single reduce task input before applying combinerWebinfodb backup: retrieve and sortMap worker: retrieve webpage from webinfodbReduce worker: gunzip and press webpage by lsrPartitioner: a list of urlsASP log: retrieve and split by products PLSA30M docs, 700K words and 300/1000 topicsChallenges
25、during developmentComplexity500,000+ C+ code linesAsynchronous RPC callsFault toleranceFailed/slow machinesLarge clock skewTransient failed DNSMomentary network breakUndetected network transferring errorLost、disordered、delayed network packetsRoadmap to PyramidPreparationDFS designDFS development & t
26、estDCS designDCS development & testDTS designDTS development & test200720082009Q3Q4Q1Q2Q3Q4Q1Q2Q3Q4TodayDFS trial appDCS trial appDTS trial appReferencesReference:Pyramid Wiki: DCS SDK: Q & ADFS WorkerData are divided into chunks (256MB)Chunks are distributed among workersChunks are replicated on mu
27、ltiple workers for reliability as well as performanceClients communicates with workers for data directlyDFS MasterKeeps all metadata in memoryFile namespaceChunk namespaceLogs all metadata mutationsSends/receives heartbeat/response to/from all workersDetects machine failure and migrates chunks autom
28、aticallyBalances load between workers automaticallyShadow masters to counter against master failureClients communicate with the master for metadata onlyDFS Master failurePrimary master + shadow master(s)A shadow master keeps itself synchronized by replaying commit log produced by the primary masterA
29、 shadow master lags the primary master, typically fractions of a secondA shadow master es the primary after the latter failsDFS: Performance considerationsLarge chunk sizeChunk replicas distribution policyConcurrent appendingNearest network data transferringMasters batch commit logMasters in-memory
30、metadata policyMasters copy-on-write metadata structure10:1 read/write performance gapDFS: 3 replicasReliability and PerformanceDCS: The partition functionDefault partitionermd5(key) mod RMetamorphosis: Hash(f(key)Hash(url) = f(site(url)Hash(url) = f(domain(url)Hash(key+x) = f(key)Other partitioners
31、SeparatorsHow a worker performs a mutationA worker always writes a mutation to a memory patchReads and writes can be proceeded in parallel by adopting copy-on-write techniqueThe in-memory portion is written to disk and reset to empty when its size reaches a thresholdThe original tablet and its disk patches are immutable to accelerate reading, writing and splittingThe tablet is rew
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- 陕西西安铁一中学2024年化学九年级第一学期期末达标测试试题含解析
- 黑龙江省齐齐哈尔市第二十一中学2024年数学七上期末监测试题含解析
- 广东省珠海市香洲区前山中学2024年物理八年级第一学期期末调研模拟试题含解析
- 天津市宝坻区名校2024年数学八年级第一学期期末质量检测模拟试题含解析
- 生物科技参股经营合同范本
- 生物大分子逐步沉淀技术的实验指南
- 2025至2030婴儿护肤品行业市场深度调研及供需格局及有效策略与实施路径评估报告
- 2025至2030中国自由潜水鳍行业市场深度研究及发展前景投资可行性分析报告
- 2025至2030中国自助旅游行业市场发展分析及竞争格局与投资发展报告
- 2025至2030中国自动血管贴标机及标本运输箱行业产业运行态势及投资规划深度研究报告
- 2025年校长职级考试题及答案
- 国家能源集团采购管理规定及实施办法知识试卷
- 2024年广州市南沙区社区专职招聘考试真题
- 山东医药技师学院招聘笔试真题2024
- (高清版)DB13(J)∕T 8556-2023 建设工程消耗量标准及计算规则(园林绿化工程)
- GC/T 1401-2022国家物资储备标志及使用规范
- QC小组活动记录【范本模板】
- JJF 1334-2012混凝土裂缝宽度及深度测量仪校准规范
- GB/T 3003-2017耐火纤维及制品
- GB/T 1094.1-2013电力变压器第1部分:总则
- 经济责任审计报告
评论
0/150
提交评论