Ozone下一代数据湖存储方案_第1页
Ozone下一代数据湖存储方案_第2页
Ozone下一代数据湖存储方案_第3页
Ozone下一代数据湖存储方案_第4页
Ozone下一代数据湖存储方案_第5页
已阅读5页,还剩36页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

1、Ozone Next-Gen Storage for Data LakeOzone下一代数据湖存储解决方案AgendaABC about Data LakeOverview of Apache OzoneOzone Architecture, Design and DetailsCurrent Status, Work In Progress and Release PlanOzone in TencentScenarioWhat is Data Lake?A data lake is a centralized repository that allows you to store all

2、your structured and unstructured data at any scale.You can store your data as-is, without having to first structure the data, and run different types of analytics - from dashboards and visualizations to big data processing, real-time analytics, and machine learning to guide better decisions.Why Data

3、 Lake?DatabaseData WarehouseObject StoreData LakeBIMachine LearningData ManagemenDtata IDENoSQLCore Features of Data Lake SolutionData Lake Solution includes Data Lake Storage, Data Lake Managementand Data Lake Analytics.Difference compared to data warehouseFollow the natural structure of dataMainly

4、 used for ad-hoc queries and heterogeneous analyticsNot good at routine data modeling, data mart and data governanceData Lake StorageData Lake MgmtData Lake Analytic sData Lake AnalyticsDifferent patterns of analytic workloads to process different dataData Lake ManagementMetadata governance to manag

5、e the lifecycle of heterogeneous dataData Lake StorageStorage system to store different patterns of data, structured and unstructuredScenario of Data LakeData ScientistData AnalystAd-hoc queries to investigate the value of dataData explorationInteractive queries on heterogeneous dataData exploration

6、Ad-hoc queries to investigate the value of dataObject StoreData Lake AnalyticsBINoSql StorageData Lake AnalyticsBIData Lake AnalyticsBIInteractive queries on heterogeneous dataTarget PersonaData AnalystCreate different analysis modelCreate layered data martData modelingData visualizationData Scienti

7、stInteractive data explorationML and DLHyper-parameter tuningProject ManagerProject managementUser managementCoordinationData EngineerData collect and injectETL task schedulingData preprocess, data governanceTheme of Data Lake Storage sideScalabilityCloudMachine LearningRetrospect for HDFS Architect

8、ureNN store all metadata in memoryLow latency metadata operationsEasy Scaling IO + PBs+ClientsMetadata in memory is both the strength and weakness of HDFSWhy Ozone?HDFS has scaling problemsSome users have make your HDFS healthy” day.200 million files for regular userscompanies with committers/core d

9、evs - 400-600 million New Opportunities and ChallengesCloudStreamingSmall files are the normScaling ChallengesWhen scaling cluster up to 4K+ nodes with about 500M filesNamespace metadata in NNBlock management in NNFile operation concurrencyBlock reports handleClient/RPC 150K+Slow NN startupSmall fil

10、es in HDFS make thing worse !What is Apache Ozone?Object Store for Big DataScale both in terms of objects, IOPS.Name node is not a bottleneck anymore.A set of micro-services each capable of doing its own stuff.Leverage learnings from supporting HDFS across a large set of use cases.Apache YARN, Map R

11、educe, Spark, Hive are all tested and certified to work with Apache Ozone. No application changes are required to work with Ozone.Supports K8s, CSI and ability to run on K8s natively.A spiritual successor to HDFS.Ozone Architecture OverviewStorage Container ManagerDatanodeDatanodeDatanodeApacheRatis

12、ApacheRatisApacheRatisOzone ManagerNameNodeOzone ManagerOzone ManagerNameNodeNameNodeOzone ManagerNamespace layer for Ozone Manage objects in a flat namespaceVolume/Bucket/KeyLSM-based K-V store for metadata LevelDB/RocksDB/Benefits compared with HDFS Namenode Easy to manage and scale1B keys tested

13、in a single OM Scale independent of block layer Easily shard based on BucketNo GC pressureNot all in memoryHDDS Storage ContainersHDDS datanode is a plug-in service running in DatanodesContainer is basic unit of replication (2-16GB)Fully distributed block metadata in LSM-based K-V storeNo centralize

14、d block map in memory like HDFSKey - Value storeOpen/Close ContainersClose Container: keys and their blocks are immutableWrite is not allowed.Container closed when is it is close to full or failure.CLOSEDOPENOpen Container: keys and their blocks are mutableWrite via Apace RatisHDDS PipelinesHDDS wri

15、tes to Containers using Pipelines.HDDS uses Apache Ratis based pipeline, which still appears as a replicated stream.Apache Ratis is a highly customizable Raft consensus protocol library.Support high throughput datareplication use cases like Ozone.Leader DNDN 1DN 2Ozone - Write PathCreate a fileBlock

16、s are allocated by OM/SCM.Blocks are written directly to data nodesVery similar to HDFSWhen a file is closed, it is visible for others to use.Ozone - Read PathReads the block locations from OM.The client reads data directly from DatanodesAKA, same old HDFS protocol.Ozone relies on all things good in

17、 HDFS - Including source code.SCM Storage Container ManagerNode ManagementHandle Node ReportContainer ManagementHandle Container ReportReplication ManagementReplication of Closed Containers instead of blocksUnder/Over replicated ContainersPipeline ManagementCreate/Remove pipelines for Open Container

18、sSecurityApprove & Issue x509 Certificate for OM/DNsHeartbeat from DatanodesHeartbeat from data nodes contain many reports.SCM decodes these into different ReportsDifferent Handlers handle these reports inside SCM.Container State ManagerContainer State Manager learns about the state of Containers vi

19、a Container Reports.Container Manager maintains all state in two classes.Node-to-Container Map Which keeps track of what containers are on each Node.Container State Map Keeps allcontainer states.If we lose a container, Container State Manager will detect that and queue replication requests to Replic

20、a Manager.Node State ManagerNode State Manager learns about the state of Nodes via Node Reports.Node Manager keeps track of things like Node liveness. For example, if the Node is live or Dead.If the Node status changes then NodeManager fires of Events like DEAD_NODE, STALE_NODE etc.Replication Manag

21、erContainer State manager detects the Container Replica is under or over replicated.Container State Manager posts events to Replica Manager asking for Replication.Replica Manager maintains this state in replication pending queues as well as in-flight state.Pipeline ManagerPipeline Manager keeps trac

22、k of open pipelines in the Cluster.Pipeline Manager keeps track of the health of the Pipeline.When and if a node goes offline, Pipeline manager closes Containers and Pipelines on that Node.Ozone Deployment OptionsMultiple Ozone Managers + Multiple SCMsOzone/HDDS ProtocolsWhen should we use Ozone?If

23、you have a scale issue - Files or Throughput.If you need an archival store for HDFS or large data store. If you need S3 or cloud-like presence on-prem.If you want to set up dedicated storage clusters. If you have lots of small files.If you are moving to K8s, and needs a big data capable file system.

24、If you are having new HDFS deployments.What are Ozones Microservices?Namenode or Ozone Manager which deals with file names.Block Server or SCM which deals with block allocation and Physical Servers.Recon Server - Control Plane.S3 GatewayDatanodes:o ne CLecoverOz1one c uent:on e ManagerOzones scalabi

25、lityOzone is designed for scale. The first release ofOzone will officially support 10 billion keys.Ozone achieves this by a combination of factors.Partial namespace in Memory - That is, file system metadata is loaded ondemand.Off-Heap Memory usage - To avoid too much GC, we rely on off-heap native m

26、emory. This allows us to get away from GC issues.Multiple Ozone Managers and Block Services - Users can scale OM or SCM independently. The end-users will not even know since the Ozone protocol does this scaling automatically.Creating large aggregations of metadata called Storage containers.Distribut

27、ing Metadata more evenly across the cluster including Datanodes.Multiple OMs and also will have the ability to read from the secondaries.About Data LocalityDefinitions:Process Local: Data local to the process of the computationNode Local: Data local to the computation nodeRack Local: Data in the sam

28、e rack with the computation nodeRegion/DC Local: Data in the same region/data center but on different racks/zones to the computation nodeWhy Locality is Important:Low latencyHigh throughputLess network trafficFast job executionBetter cluster utilizationReliability: e.g., store my data with 2 replica

29、s in EU, 2 replicas in U.S. and 1 in AsiaOzones Data LocalityOzone Hierarchical TopologyTopology-Aware ReadTopology-Aware Container ReplicationOzones SecurityHDFS security is based on Kerberos.Kerberos cannot sustain the scale of applications running in a Hadoop Cluster.So HDFS relies on Delegation

30、tokens and block tokens.Ozone uses the same, so applications have no change.SCM comes with its own Certificate Authority.End users do NOT need to know about it.Security is on-by-default, Not an afterthought.Merged HDDS-4 into Trunk, next release will have security.First class integration with Apache

31、 Ranger.TDE - transparent disk encryption support.Ranger Service Definitions for Ozone SecurityOzones HALike HDFS, Ozone will have HA.Unlike HDFS, HA is a built-in feature of Ozone.Users need to deploy three instances of OM/SCM. That is it.HA is automatic even when you run a single node, OM assumes it is in a single HA configuration mode.Ozones TestingOzone uses K8s based clusters for Testing.Both long running

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论