云计算-阿里专家课件02交大aliyunstorage fundmentals_第1页
云计算-阿里专家课件02交大aliyunstorage fundmentals_第2页
云计算-阿里专家课件02交大aliyunstorage fundmentals_第3页
云计算-阿里专家课件02交大aliyunstorage fundmentals_第4页
云计算-阿里专家课件02交大aliyunstorage fundmentals_第5页
已阅读5页,还剩55页未读 继续免费阅读

付费下载

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

1、Fundamentals to Storage Systems Jason WuSenior Director, Aliyun ApsaraAbout MyselfJason Wu (吴结生)Alibaba Cloud Computing,Seattle (08/2014 present)Microsoft Azure Storage, Seattle (08/2008 08/2014) , San Jose (10/2004 08/2014)计算所和智能计算机研究中心(07/1997 08/1999)Education:The Ohio State University, PhD in Co

2、mputer Science, 2004USTC, Master in Computer Science, 1997Anhui Normal University in Math and Computer Science, 1994What are important to a storage systemDurableReliableAccess anytime, anywhereUnlimited sizeCheapPay as you useEasy to manage3Pain Points from IT/Storage Managers/Professionals4Storage

3、TrendsData growth is around 70%/ year, most of it is unstructuredScale-out rather than scale-upObject is gaining lot of traction but file is not going awayNAS will stay as significant playerAnalysts predict NAS grow at a CAGR of 25.85% over 2014-2019. ( )Unified Storage NAS, SAN, and Object52016年存储容

4、量的预估6FOBS: File and Object Based Storagefrom IBM LinuxCon 20147Cloud StorageFully managed storage serviceSoftware defined storageHighly durable and availableUnlimited scalabilityPay as you useEconomic of scaleSecure8EMCHP DellIBMNetApp2017EMCAWSAzureHP DellIBMNetAppAliyun Storage ServicesStructured

5、and Unstructured storage servicesWide range of senarios: block, file, object, table, queue, etcStrongest team in China in distributed storage systems e you to join in us!9Pangu Distributed StorageObject Storage ServiceNetworkAttachedStorageBlockStorageLoggingServiceTable StorageMessagingNotification

6、ServiceArchiveStorageStorage Services Built Upon Pangu SystemMChunk Servers(CS)PaxosMMPanguDistributedFile SystemTwoInterfacesServicesAppend-Only FileRandom-Access FileOSSOTSMNSLogHubBlock StorageNAS MaxCompute10CAP Theorem11CAP Theorem History12Eric Brewer, UC BerkeleyCAP principle, 1995. Armando F

7、ox and Eric Brewer, “Harvest, Yield and Scalable Tolerant Systems”, HotOS95CAP Conjecture, 2000. Symposium on Principles of Distributed Computing (PODC).CAP theorem, 2002. Seth Gilbert and Nancy Lynch of MIT published a formal proof of Brewers conjecture, “Brewers conjecture and the feasibility of c

8、onsistent, available, partition-tolerant web services”, ACM SIGACT News, Volume 33 Issue 2 (2002), pg. 51-59.CAP revisited, 2012. Brewer “CAP Twelve Years Later: How the Rules Have Changed”CAP TheoremAny networked shared-data system can have at most two of three desirable properties:Consistency (C)

9、: having a single up-to-date copy of the data;High availability (A) of that data; andTolerance to network partitions (P)“2 out of 3” is misleadingOversimplify the tension among three propertiesCAP prohibits only a tiny part of the design space: perfect availability and consistency in the presence of

10、 partitions, which are rare.13CAP Theorem Examples14Local system: lock and isolationNetworked systemNeed controlAffect partition tolerance, a1 (availability), a2 (consistency)How to Use CAP Theorem to Guide DesignChoose between consistency and availability when partitions are presentMaximize combina

11、tions of consistency and availability that make sense for the specific applicationA wide range of flexibility for handling partitions and recovering from them15Managing Partitions16Detect the start of a partitionEnter an explicit partition modeLimit some operationsE.g. unique elementsRecord the tran

12、saction and perform laterVersion vectors and casual consistency for availabilityInitiate partition recovery when communication is restoredthe state on both sides must e consistent, andthere must be compensation for the mistakes made during partition mode.Durable and Reliable Storage17DurabilityDurat

13、ion of time the system is able to provide access to an entityDurability deals with permanent data loss11 nines durability1 out of 100K files in 1 million years may be lostAWS Simple Storage Service Aliyun Object Storage ServiceHow to build highly durable storage system?18Replicated Distributed Stora

14、ge SystemsReplication modesSynchronous replicationAsynchronous replicationSemi-synchronous replicationPoint-in-time replicationPaxos state machine replication1920Distributed State MachineFault-tolerance through replicationNeed to ensure that replicas remain consistentReplicas must process requests i

15、n the same order.The Distributed Consensus ProblemIn a distributed system, how can we:Select a single action among many options?How can this be done in a fault-tolerant way?Simple solution: A single node acts as the “decider.”But this is not fault tolerant. (What if the decider fails?)A better solut

16、ion: Paxos21Fault-Tolerant ConsensusRequirementsSafetyOnly a value that has been proposed may be chosen.Only a single value is chosen.A process never learns that a value has been chosen until it actually has been.GoalsLiveness22AssumptionsStable StorageFailures“Fail Stop” assumptionWhen a node fails

17、, it ceases to function entirely.May resume normal operation when restarted.MessagesMay be lost.May be duplicated.May be delayed (and thus reordered).May not be corrupt. 2324ProposalAn alternative proposed by a proposer.Consists of a unique number and a proposed value.( 42, B )We say a value is chos

18、en when consensus is reached on that value.Paxos TermsProposerSuggests values for consideration by Acceptors. Advocates for a client.AcceptorConsiders the values proposed by proposers.Renders an accept/reject decision.LearnerLearns the chosen value.In practice, each node will usually play all three

19、roles.A1P1Strong Majority“Strong Majority” / “Quorum”A set of acceptors consisting of more than half of all acceptors.Any two quorums have a nonempty intersection.Helps avoid “split-brain” problem.Acceptors decisions are not in agreement.Common node acts as “tie-breaker.”In a system with 2F+1 accept

20、ors, F acceptors can fail and well be OK.Quorums in a system with seven acceptors.A1A6A4A5A7A3A225Pangu Distributed StorageMChunk Servers(CS)PaxosMMPanguDistributedFile SystemA distributed storage system similar to GFS (Google) and XStream (Azure)Pangu MasterPaxos-based highly available and durable

21、systemManage all disks from each node into an integral disk poolData placement, replication, failure recoveryManage file metadata and provide directory conceptChunkservers (CS)Write to and read from chunks on CS disksReport CS information to Pangu MasterData is replicated to three machinesPlacement

22、policyDecided by Pangu Master based on CS info.26Durability within A Single Availability ZoneDisk FailuresRack FailuresMachine FailuresSwitch FailuresData CenterDesigned for 11 9s durabilityThree replicas synced to three different machines from three different racksAutomatic re-replication under fai

23、luresAutomatic and periodical bit-rot/checksum verificationLose data when a data center is not recoverable27Failure Detection and Auto-RecoveryData node failure detectionHeartbeat between Master and data nodesGossip protocols from clients of reasoning about the state of data nodes Disk Failure and B

24、it-rotData nodes periodically check disk states (e.g. SMART system)Periodical CRC verificationRe-replication when failures are detectedMaster initiates automatic re-replicationIntelligent Tail-latency improvementAvoid slow/no-response data nodes automaticallyHedged requests to different data nodes t

25、o improve tail latency28Re-replication Policies1TB4TB4TB4TBRe-replication priority is inverse to the number of replicasMaster coordinates and schedules re-replication tasksOptimize the resource and destination nodes to take advantages of parallelism Accurate re-replication network bandwidth control,

26、 minimize impact on the live requests29Highly Scalable and Available Storage System 30Object Storage BucketAZ A or Region ASLBAccess object in a bucket via the URL: Data accessKey-ValuePartition LayerOSS ServerDurable Storage Layer Replication within AZAZ B or Region BSLBKey-ValuePartition LayerOSS

27、ServerDurable Storage LayerReplication within AZCross Region/AZ Geo ReplicationData access31Three Layer ArchitectureMChunk Servers(CS)PaxosMMKVServerKVServerKVServerKVServerKVMasterNvwaLock ServiceKVPartition LayerPanguDistributedFile SystemKVServerOSSServerOSSServerOSSServerOSSServerOSSServerFront-

28、EndsLayer32Software Load BalanceLVS ClusterLVS ClusterLVS ClusterLVS ClusterInternetSLB ClustersOSS Server ClusterSupport millions of concurrent connections and connection create requestsCurrent generation SLB:Each machine has 2*10Gb cardsLVS supports 40Gbps traffic10 millions concurrent connections

29、 2 millions create connection requests per secondOne endpoint has multiple VIPsEach VIP mapped to a LVS cluster4 LVS clusters to support 160GbpsNext generation SLB (in testing)Each machine has 4*40Gb cardsSingle LVS supports 600GbpsEach OSS server supports2*10Gb traffic10K connectionsHundreds of OSS

30、 servers33Storage Services Built Upon Pangu SystemMChunk Servers(CS)PaxosMMPanguDistributedFile SystemTwoInterfacesServicesAppend-Only FileRandom-Access FileOSSOTSMNSLogHubBlock StorageNAS MaxComputePangu manages 3.5EB raw disk spaces, including Object, Block and data for max compute (hadoop like bi

31、g data processing)There is about 500PB object storage34Partition Layer Scalable Object Management SystemBillions of objects across all buckets are storedEfficiently enumerate, query, get, insert, and updateDeal with highly dynamic traffic such as hot objects, peak load, and traffic burstsNeed a scal

32、able and high performance object index systemSpread the object management across 1000s of serversDynamically load balanceAutomatic partitioning based on loadAutomatic partition movement based on loadA scalable Key-Value Store systemKVServerKVServerKVServerKVServerKVMasterNvwaLock ServiceKVServerKVPa

33、rtition Layer35Scalable Object PartitioningBucketNameObjectNameObjectMetaaaaaaaaa.zzzzzzzzSplit index into RangePartitions based on load Split at Object Name boundariesPartitionMap tracks Index RangePartition assignment to partition serversOSS servers caches the PartitionMap to route user requestsEa

34、ch part of the index is assigned to only one Partition Server at a timeStorage ClusterKVServerKVServerBucketNameObjectNameObjectMetarichardVideos/tenniszzzzzzzzBucketNameObjectNameBlobNameharryPictures/sunsetrichardVideos/soccerKV ServerKVMasterOSSServerKVS 2KVS 1A-H: KVS1H-R: KVS2R-Z: KVS3A-H: KVS1

35、H-R: KVS2R-Z: KVS3PartitionMapObject Index Partition MapBucket NameObjectNameObjectMetaaaaaaaaaharryPictures/sunriseA-HR-ZH-RKVS 336Object Index Range Partition - Log Structured Merge-TreeCompactedSSTableCheckpointSSTableCheckpointSSTableObject Index FileObject DataBlob DataBlob DataObject Data File

36、Commit Log FilePartition Meta Log FileIndex CacheBloom FiltersLoad MetricsWritesRead/QueryMemory TablePersistent Data on PanguPartition Memory DataCheckpoint37Write ObjectCompactedSSTableCheckpointSSTableCheckpointSSTableObject Index FileObject DataObject DataObject DataObject Data FilesCommit Log F

37、ilePartition Meta Log FileMemory Table1. data2. indexOSS ServerWrite DataHigh bandwidth from multiple object data filesWrite data directly from OSS server to save network hopsObject DataObject DataObject DataKV ServerIndex CacheBloom Filters38Read ObjectCompactedSSTableCheckpointSSTableCheckpointSST

38、ableObject Index FileObject DataObject DataObject DataObject Data FilesCommit Log FilePartition Meta Log FileMemory Table2. data1. indexOSS ServerRead DataBig objectRead from OSS serverSave network hopSmall objectPiggy back from KV serverObject DataObject DataObject DataKV ServerIndex CacheBloom Fil

39、ters39LSM-tree Based Index System40How to Build Fast Index?Suppose a machine generates 1000 logs per second, build a system to store logs from thousands of machinesQuery based on time,machine,event name,Key-value storage systemInsert(k,v)Delete(k)v = Query(k)v1,v2, = range-query(k1,k2,)41LSM Log-Str

40、uctured Merge-Tree42LSM-trees Insert43LSM-trees Lookup44Why LSM-trees45Support Good for hard drivesBatch and write sequentiallyHigh sequential throughputSequential access up to 1000 x faster than randomNot optimal for diskLarge write/read amplificationsWastes device resources I/O Amplifications in L

41、SM trees46How to Reduce I/O Amplification?47Reduce write amplification by COW (copy-on-write)Read amplification worsen for querySeparating Keys from Values48WiscKey: Separating Keys from Values in SSD-conscious Storage Lanyue Lu, Thanumalayan Sankaranarayana Pillai Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau14th USENIX Conference on File and Storage Technologies (FAST), 2016SSD is good at

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论