Big Data Overview 大数据综述【PPT课件】_第1页
Big Data Overview 大数据综述【PPT课件】_第2页
Big Data Overview 大数据综述【PPT课件】_第3页
Big Data Overview 大数据综述【PPT课件】_第4页
Big Data Overview 大数据综述【PPT课件】_第5页
已阅读5页,还剩12页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

Big DataOverview of storage and processing,Big Data,Big Data does not only relate to the size of dataComplexity: missing information, dummy data, organizationProcessing: Software, processing power, parallel and distributed computing Data Transfer: Limitations of current systems, CPU intensiveStorage: Data sets beyond relational database, clusters, data centers, distributed dataUser Interaction: Non-programmers need to perform complex information, real time GUI interfaces, visualization of data,Where the Field is,Primary sources of big dataMeteorology Complex physics simulationsBiology BusinessWeb searchingSocial networkingTelecommunications Many programs for storage and processingMost Popular: HDFS, GFS, Hadoop, and MapReduceNo standard for processing/storing data No common “off the shelf” software Increases the difficulty in mining data within a field or industry,Difficulties,StorageDeveloping a system in which very large amounts of data can be stored securely and accessed quicklyTransferTransfer from the storage site to the processing siteMoving large amounts of data over TCP is costlyProcessingHow powerful of a system is needed?“There is a lot of data but no information”Processing the data in an efficient manner and obtaining the correct information,The Direction of Data Storage,NoSQL Allows storage of massive data sets without the need for overwhelming tables and indexingEach cluster stores part of the data and replicates it on other clustersMaster/Slave architectureHDFS (Hadoop Distributed File System)P2P architectureCassandra ColumnFamily data model Increased difficulty for data miningNo Join operationsPulling in more data than neededIncreased transfer times, processing power,More About NoSQL(ACID).,The key advantage of schema-free design is that it enables applications to quickly upgrade the structure of data without table rewrites. The data validity and integrity aspect is enforced at the data management layer. NoSQL typically does not maintain complete consistency across distributed servers because of the burden this places on databases, particularly in distributed systems. The Consistency, Availability, Partition (CAP) Theorem states that with consistency, availability, and partitioning tolerance, only two can be optimized at any time.Traditional relational databases enforce strict transactional semantics to preserve consistency, but many NoSQL databases have more scalable architectures that relax the consistency requirement.Some NoSQL databases put objects into a conflict state when this occurs. However, it is inevitably the responsibility of the application to deal with these conflicts.,Important Papers by Google,Google File SystemMap ReduceBig Table,Google File System,Google has reexamined traditional choices /Assumptions and explored radically different points in the design space.First, component failures are the norm rather than the exception. -The system is built from many inexpensive commodity components that often fail. It must constantly monitor itself and detect, tolerate, and recover promptly from component failures on a routine basisSecond, files are huge by traditional standards. Multi-GB files are common.Third, most files are mutated by appending new data rather than overwriting existing data.Fourth, co-designing the applications and the file system API benefits the overall system by increasing our flexibility .,Consistancy Model.,Random writes within a file are practically non-existent. Once written, the files are only read, and often only sequentially.A variety of data share these characteristics.Appending becomes the focus of performance optimization and atomicity guarantees, while caching data blocks in the client loses its appeal.Google has introduced an atomic append operation so that multiple clients can append concurrently to a file without extra synchronization between them.Snapshot :creates a copy of a file or a directory treeat low cost.Record :append allows multiple clients to append data to the same file concurrently while guaranteeing the atomicity of each individual clients append. (Without Additional Locking).,GFS Architecture,Master servers keep metadata on the various data files.Chunk servers store the actual data on disk. Each chunk is replicates across three different chunk servers to create redundancy in case of server crashes. Once directed by a master server, a client application retrieves files directly from chunk servers.,Map Reduce Operation,MapReduce is a programming model and an associated implementation for processing and generating large data sets.Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs.A Reduce function that merges all intermediate values associated with the same intermediate key.The MapReduce system has three different types of servers.The Master server assigns user tasks to map and reduce servers. It also tracks the state of the tasks. - The Map servers accept user input and performs map operations on them. The results are written to intermediate files.The Reduce servers accepts intermediate files produced by map servers and performs reduce operation on them.The steps look like: GFS - Map - Shuffle - Reduction - Store Results back into GFS.- In MapReduce a map maps one view of data to another, producing a key value pair,Data transferred between map and reduce servers is compressed. The idea is that because servers arent CPU bound it makes sense to spend on data compression and decompression in order to save on bandwidth and I/O.,Map and Reduce. (Contd.),map(String key, String value):/ key: document name/ value: document contentsfor each word w in value:EmitIntermediate(w, 1);reduce(String key, Iterator values):/ key: a word/ values: a list of countsint result = 0;for each v in values:result += ParseInt(v);Emit(AsString(result);,Big Table,BigTable is a large scale, fault tolerant, self managing system that includes terabytes of memory and petabytes of storage. It can handle millions of reads/writes per second.BigTable is a distributed hash mechanism built on top of GFS. It is not a relational database. It doesnt support joins or SQL type queries. It provides lookup mechanism to access structured data by key. GFS stores opaque data and many applications needs has data with structure.Machines can be added and deleted while the system is running and the whole system just works. Each data item is stored in a cell which can be accessed using a row key, column key, or timestamp. BigTable has three different types of servers: ( Master, Tablet ,Lock Servers),Hardware strategy,Use ultra cheap commodity hardware and built software on top to handle their death. A 1,000-fold computer power increase can be had for a 33 times lower cost if you you use a failure-prone infrastructure rather than an infrastructure built on highly reliable components. You must build reliability on top of unrelia

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论