版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
1、Fundamentals to Storage Systems Jason WuSenior Director, Aliyun ApsaraAbout MyselfJason Wu (吴结生)Alibaba Cloud Computing,Seattle (08/2014 present)Microsoft Azure Storage, Seattle (08/2008 08/2014) , San Jose (10/2004 08/2014)计算所和智能计算机研究中心(07/1997 08/1999)Education:The Ohio State University, PhD in Co
2、mputer Science, 2004USTC, Master in Computer Science, 1997Anhui Normal University in Math and Computer Science, 1994What are important to a storage systemDurableReliableAccess anytime, anywhereUnlimited sizeCheapPay as you useEasy to manage3Pain Points from IT/Storage Managers/Professionals4Storage
3、TrendsData growth is around 70%/ year, most of it is unstructuredScale-out rather than scale-upObject is gaining lot of traction but file is not going awayNAS will stay as significant playerAnalysts predict NAS grow at a CAGR of 25.85% over 2014-2019. ( )Unified Storage NAS, SAN, and Object52016年存储容
4、量的预估6FOBS: File and Object Based Storagefrom IBM LinuxCon 20147Cloud StorageFully managed storage serviceSoftware defined storageHighly durable and availableUnlimited scalabilityPay as you useEconomic of scaleSecure8EMCHP DellIBMNetApp2017EMCAWSAzureHP DellIBMNetAppAliyun Storage ServicesStructured
5、and Unstructured storage servicesWide range of senarios: block, file, object, table, queue, etcStrongest team in China in distributed storage systems e you to join in us!9Pangu Distributed StorageObject Storage ServiceNetworkAttachedStorageBlockStorageLoggingServiceTable StorageMessagingNotification
6、ServiceArchiveStorageStorage Services Built Upon Pangu SystemMChunk Servers(CS)PaxosMMPanguDistributedFile SystemTwoInterfacesServicesAppend-Only FileRandom-Access FileOSSOTSMNSLogHubBlock StorageNAS MaxCompute10CAP Theorem11CAP Theorem History12Eric Brewer, UC BerkeleyCAP principle, 1995. Armando F
7、ox and Eric Brewer, “Harvest, Yield and Scalable Tolerant Systems”, HotOS95CAP Conjecture, 2000. Symposium on Principles of Distributed Computing (PODC).CAP theorem, 2002. Seth Gilbert and Nancy Lynch of MIT published a formal proof of Brewers conjecture, “Brewers conjecture and the feasibility of c
8、onsistent, available, partition-tolerant web services”, ACM SIGACT News, Volume 33 Issue 2 (2002), pg. 51-59.CAP revisited, 2012. Brewer “CAP Twelve Years Later: How the Rules Have Changed”CAP TheoremAny networked shared-data system can have at most two of three desirable properties:Consistency (C)
9、: having a single up-to-date copy of the data;High availability (A) of that data; andTolerance to network partitions (P)“2 out of 3” is misleadingOversimplify the tension among three propertiesCAP prohibits only a tiny part of the design space: perfect availability and consistency in the presence of
10、 partitions, which are rare.13CAP Theorem Examples14Local system: lock and isolationNetworked systemNeed controlAffect partition tolerance, a1 (availability), a2 (consistency)How to Use CAP Theorem to Guide DesignChoose between consistency and availability when partitions are presentMaximize combina
11、tions of consistency and availability that make sense for the specific applicationA wide range of flexibility for handling partitions and recovering from them15Managing Partitions16Detect the start of a partitionEnter an explicit partition modeLimit some operationsE.g. unique elementsRecord the tran
12、saction and perform laterVersion vectors and casual consistency for availabilityInitiate partition recovery when communication is restoredthe state on both sides must e consistent, andthere must be compensation for the mistakes made during partition mode.Durable and Reliable Storage17DurabilityDurat
13、ion of time the system is able to provide access to an entityDurability deals with permanent data loss11 nines durability1 out of 100K files in 1 million years may be lostAWS Simple Storage Service Aliyun Object Storage ServiceHow to build highly durable storage system?18Replicated Distributed Stora
14、ge SystemsReplication modesSynchronous replicationAsynchronous replicationSemi-synchronous replicationPoint-in-time replicationPaxos state machine replication1920Distributed State MachineFault-tolerance through replicationNeed to ensure that replicas remain consistentReplicas must process requests i
15、n the same order.The Distributed Consensus ProblemIn a distributed system, how can we:Select a single action among many options?How can this be done in a fault-tolerant way?Simple solution: A single node acts as the “decider.”But this is not fault tolerant. (What if the decider fails?)A better solut
16、ion: Paxos21Fault-Tolerant ConsensusRequirementsSafetyOnly a value that has been proposed may be chosen.Only a single value is chosen.A process never learns that a value has been chosen until it actually has been.GoalsLiveness22AssumptionsStable StorageFailures“Fail Stop” assumptionWhen a node fails
17、, it ceases to function entirely.May resume normal operation when restarted.MessagesMay be lost.May be duplicated.May be delayed (and thus reordered).May not be corrupt. 2324ProposalAn alternative proposed by a proposer.Consists of a unique number and a proposed value.( 42, B )We say a value is chos
18、en when consensus is reached on that value.Paxos TermsProposerSuggests values for consideration by Acceptors. Advocates for a client.AcceptorConsiders the values proposed by proposers.Renders an accept/reject decision.LearnerLearns the chosen value.In practice, each node will usually play all three
19、roles.A1P1Strong Majority“Strong Majority” / “Quorum”A set of acceptors consisting of more than half of all acceptors.Any two quorums have a nonempty intersection.Helps avoid “split-brain” problem.Acceptors decisions are not in agreement.Common node acts as “tie-breaker.”In a system with 2F+1 accept
20、ors, F acceptors can fail and well be OK.Quorums in a system with seven acceptors.A1A6A4A5A7A3A225Pangu Distributed StorageMChunk Servers(CS)PaxosMMPanguDistributedFile SystemA distributed storage system similar to GFS (Google) and XStream (Azure)Pangu MasterPaxos-based highly available and durable
21、systemManage all disks from each node into an integral disk poolData placement, replication, failure recoveryManage file metadata and provide directory conceptChunkservers (CS)Write to and read from chunks on CS disksReport CS information to Pangu MasterData is replicated to three machinesPlacement
22、policyDecided by Pangu Master based on CS info.26Durability within A Single Availability ZoneDisk FailuresRack FailuresMachine FailuresSwitch FailuresData CenterDesigned for 11 9s durabilityThree replicas synced to three different machines from three different racksAutomatic re-replication under fai
23、luresAutomatic and periodical bit-rot/checksum verificationLose data when a data center is not recoverable27Failure Detection and Auto-RecoveryData node failure detectionHeartbeat between Master and data nodesGossip protocols from clients of reasoning about the state of data nodes Disk Failure and B
24、it-rotData nodes periodically check disk states (e.g. SMART system)Periodical CRC verificationRe-replication when failures are detectedMaster initiates automatic re-replicationIntelligent Tail-latency improvementAvoid slow/no-response data nodes automaticallyHedged requests to different data nodes t
25、o improve tail latency28Re-replication Policies1TB4TB4TB4TBRe-replication priority is inverse to the number of replicasMaster coordinates and schedules re-replication tasksOptimize the resource and destination nodes to take advantages of parallelism Accurate re-replication network bandwidth control,
26、 minimize impact on the live requests29Highly Scalable and Available Storage System 30Object Storage BucketAZ A or Region ASLBAccess object in a bucket via the URL: Data accessKey-ValuePartition LayerOSS ServerDurable Storage Layer Replication within AZAZ B or Region BSLBKey-ValuePartition LayerOSS
27、ServerDurable Storage LayerReplication within AZCross Region/AZ Geo ReplicationData access31Three Layer ArchitectureMChunk Servers(CS)PaxosMMKVServerKVServerKVServerKVServerKVMasterNvwaLock ServiceKVPartition LayerPanguDistributedFile SystemKVServerOSSServerOSSServerOSSServerOSSServerOSSServerFront-
28、EndsLayer32Software Load BalanceLVS ClusterLVS ClusterLVS ClusterLVS ClusterInternetSLB ClustersOSS Server ClusterSupport millions of concurrent connections and connection create requestsCurrent generation SLB:Each machine has 2*10Gb cardsLVS supports 40Gbps traffic10 millions concurrent connections
29、 2 millions create connection requests per secondOne endpoint has multiple VIPsEach VIP mapped to a LVS cluster4 LVS clusters to support 160GbpsNext generation SLB (in testing)Each machine has 4*40Gb cardsSingle LVS supports 600GbpsEach OSS server supports2*10Gb traffic10K connectionsHundreds of OSS
30、 servers33Storage Services Built Upon Pangu SystemMChunk Servers(CS)PaxosMMPanguDistributedFile SystemTwoInterfacesServicesAppend-Only FileRandom-Access FileOSSOTSMNSLogHubBlock StorageNAS MaxComputePangu manages 3.5EB raw disk spaces, including Object, Block and data for max compute (hadoop like bi
31、g data processing)There is about 500PB object storage34Partition Layer Scalable Object Management SystemBillions of objects across all buckets are storedEfficiently enumerate, query, get, insert, and updateDeal with highly dynamic traffic such as hot objects, peak load, and traffic burstsNeed a scal
32、able and high performance object index systemSpread the object management across 1000s of serversDynamically load balanceAutomatic partitioning based on loadAutomatic partition movement based on loadA scalable Key-Value Store systemKVServerKVServerKVServerKVServerKVMasterNvwaLock ServiceKVServerKVPa
33、rtition Layer35Scalable Object PartitioningBucketNameObjectNameObjectMetaaaaaaaaa.zzzzzzzzSplit index into RangePartitions based on load Split at Object Name boundariesPartitionMap tracks Index RangePartition assignment to partition serversOSS servers caches the PartitionMap to route user requestsEa
34、ch part of the index is assigned to only one Partition Server at a timeStorage ClusterKVServerKVServerBucketNameObjectNameObjectMetarichardVideos/tenniszzzzzzzzBucketNameObjectNameBlobNameharryPictures/sunsetrichardVideos/soccerKV ServerKVMasterOSSServerKVS 2KVS 1A-H: KVS1H-R: KVS2R-Z: KVS3A-H: KVS1
35、H-R: KVS2R-Z: KVS3PartitionMapObject Index Partition MapBucket NameObjectNameObjectMetaaaaaaaaaharryPictures/sunriseA-HR-ZH-RKVS 336Object Index Range Partition - Log Structured Merge-TreeCompactedSSTableCheckpointSSTableCheckpointSSTableObject Index FileObject DataBlob DataBlob DataObject Data File
36、Commit Log FilePartition Meta Log FileIndex CacheBloom FiltersLoad MetricsWritesRead/QueryMemory TablePersistent Data on PanguPartition Memory DataCheckpoint37Write ObjectCompactedSSTableCheckpointSSTableCheckpointSSTableObject Index FileObject DataObject DataObject DataObject Data FilesCommit Log F
37、ilePartition Meta Log FileMemory Table1. data2. indexOSS ServerWrite DataHigh bandwidth from multiple object data filesWrite data directly from OSS server to save network hopsObject DataObject DataObject DataKV ServerIndex CacheBloom Filters38Read ObjectCompactedSSTableCheckpointSSTableCheckpointSST
38、ableObject Index FileObject DataObject DataObject DataObject Data FilesCommit Log FilePartition Meta Log FileMemory Table2. data1. indexOSS ServerRead DataBig objectRead from OSS serverSave network hopSmall objectPiggy back from KV serverObject DataObject DataObject DataKV ServerIndex CacheBloom Fil
39、ters39LSM-tree Based Index System40How to Build Fast Index?Suppose a machine generates 1000 logs per second, build a system to store logs from thousands of machinesQuery based on time,machine,event name,Key-value storage systemInsert(k,v)Delete(k)v = Query(k)v1,v2, = range-query(k1,k2,)41LSM Log-Str
40、uctured Merge-Tree42LSM-trees Insert43LSM-trees Lookup44Why LSM-trees45Support Good for hard drivesBatch and write sequentiallyHigh sequential throughputSequential access up to 1000 x faster than randomNot optimal for diskLarge write/read amplificationsWastes device resources I/O Amplifications in L
41、SM trees46How to Reduce I/O Amplification?47Reduce write amplification by COW (copy-on-write)Read amplification worsen for querySeparating Keys from Values48WiscKey: Separating Keys from Values in SSD-conscious Storage Lanyue Lu, Thanumalayan Sankaranarayana Pillai Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau14th USENIX Conference on File and Storage Technologies (FAST), 2016SSD is good at
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- LC基础技术应用 1
- 2026东海历史面试题及答案
- 公路工程识图与制图 课件 3点的投影
- 学校咨询中心朋辈心理辅导工作手册(标准版)
- 长江生态环境宣传教育引导手册
- 计生用品生产供应商筛选管理手册
- 半成品加工不合格品管控手册 (标准版)
- 边坡绿化带固土养护手册
- 短途运输老弱客户帮扶服务手册
- 2023年三级公共营养师复习资料
- 中国儿童注意缺陷多动障碍防治指南
- GB/T 45816-2025道路车辆汽车空调系统用制冷剂系统安全要求
- 北京市海淀中学2026届中考三模物理试题含解析
- 基孔肯雅热知识测试试题含答案
- 工厂报废件管理办法
- 矿业公司保密管理制度
- 《民营经济促进法》解读与案例分析课件
- 浙江省杭州市2024年高一历史下学期6月学考模拟试卷含解析
- 《护理学导论》高职全套教学课件
- 国际学校学生综合素质评估方法
- 港口行业智能化港口物流方案
评论
0/150
提交评论