



免费预览已结束,剩余1页可下载查看
下载本文档
版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
big data ecosystemreference architectureby orit levin, microsoft corporation.july 1, 2013introductionin recent years, the ongoing innovation in distributed computing and cloud infrastructure has become one of the main factors for data collection growth, development of new database technologies and analytics, and their broadening usage. from the enterprise data warehouses to the hundreds of companies that comprise the advertising industry, advancements in a big data ecosystem are quickly incorporated in best practices and newest technologies, becoming more cost-effective and fuelling innovation. policy makers worldwide, traditionally focused on data collection, are broadening their scope of concern to include new data distribution and data usage practices. the big data ecosystem can be best described using a data-centric reference architecture (ra) showing end-to-end data collection, transformation, distribution, and usage. many players in the big data ecosystem would benefit from this analysis, particularly managers and policy makers dealing with the rapid changes in the way data is collected and transformed. the purpose of this ra is to assist the nist big data standardization effort through (1) making the life cycle of big data better understood by industry, policy makers, and users; (2) identifying relevant components and functions in order to define their boundaries, interoperability surfaces, and security implications.overviewthe big data ecosystem reference architecture is a high level data-centric diagram that depicts the big data flow and possible data transformations from collection to usage. the big data ecosystem is comprised of four main components: sources, transformation, infrastructure and usage, as shown on figure 1. security and management are shown as examples of additional supporting cross-cutting sub-systems that provide backdrop services and functionality to the rest of the big data ecosystem. figure 1: big data ecosystem reference architecturedata sourcestypically, the data behind “big data” is collected for a specific purpose, creating the data objects in a form that supports the known use at the data collection time. once data is collected, it can be reused for a variety of purposes, some potentially unknown at the collection time.data sources can be classified by three characteristics that define big data and are independent of the data content or context: volume, velocity, and variety gartner press release, “gartner says solving big data challenge involves more than just managing volumes of data”, june 27, 2011.:volume: caused by increased transaction numbers, as well as by new types of data. it can be both a storage issue and a massive analysis issue.variety: it is characterized by different formats of information including tabular data (databases), hierarchical data, documents, e-mail, metering data, video, still images, audio, stock ticker data, and financial transactions.velocity: this involves streams of data, structured record creation, and availability for access and delivery. velocity means both how fast data is being produced and how fast the data must be processed to meet demand.typically, data that can be characterized by an extreme of at least two out of these characteristics is considered “big data” that would be very difficult to process, use, and manage using traditional databases and data-processing applications.for each use case, sources of big data can be further classified by various application-specific criteria, such as the presence of information about people (that may have downstream privacy implications), the format, or specific data sources.data transformationas data propagates through the ecosystem, it is being processed and transformed in different ways in order to extract the value from the information. for the purpose of defining interoperability surfaces, it is important to identify common transformations that are implemented by independent modules, systems, or deployed as stand-alone services.editors note: the transformation functional blocks shown in figure 1 can be performed by separate systems or organizations, with data moving between those entities. (see use case i: advertising.) similar and additional transformational blocks are being used in enterprise data warehouses, but typically they are closely integrated and rely on a common data base to exchange the information. (see use case ii: enterprise data warehouse.)each transformation function may have its specific pre-processing stage including registration and metadata creation, may use different specialized data infrastructure best fitted for its requirements, and may have its own privacy and other policy considerations. common known data transformation include at least:collection: data can be collected in different types and forms. at the initial collection stage, sets of data (e.g., data records) from similar sources and of similar structure are collected (and combined) resulting in uniform security considerations, policies, etc. initial metadata is created (e.g., subjects with keys are identified) to facilitate subsequent aggregation or lookup method(s).aggregation: sets of existing data collections with easily correlated metadata (e.g., identical keys) are aggregated into a larger collection. as a result, the information about each object is enriched or the number of objects in the collection grows. security considerations and policies concerning the resultant collection are typically similar to the original collections.matching: sets of existing data collections with dissimilar metadata (e.g., keys) are aggregated into a larger collection. (for example, in advertising industry matching services correlate http cookies values with persons real name.) as a result, the information about each object is enriched. the security considerations and policies concerning the resultant collection are subject to data exchange interfaces design.editors note: with the development of use cases, more common transformational blocks to be added.data mining: according to dbta database trends and applications, /articles/editorial/trends-and-applications/what-is-data-analysis-and-data-mining-73503.aspx, jan 7, 2011 , “data mining can be defined as the process of extracting data, analyzing it from many dimensions or perspectives, then producing a summary of the information in a useful form that identifies relationships within the data. there are two types of data mining: descriptive, which gives information about existing data; and predictive, which makes forecasts based on the data.”data infrastructurebig data infrastructure is a bundle of data storage or database software, servers, storage, and networking used in support of the data transformation functions and for storage of data as needed. editors note: in figure 1, data infrastructure is to the right of the data transformation, to emphasize the natural role of data infrastructure in support of data transformations. note that the horizontal data retrieval and storage paths exists between the two, which are different from the vertical data paths between them and data sources and data usage.in order to achieve high efficiencies, data of different volume, variety and velocity would typically be stored and processed using computing and storage technologies tailored to those characteristics. the choice of processing and storage technology is also dependent on the transformation itself. as a result, often the same data can be transformed (either sequentially or in parallel) multiple times using independent data infrastructure. under development conditioning: examples include de-identification, sampling, and fuzzing.under development storage and retrieval. examples include nosql and sql databases with various specialized types of data load
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- 泸州市重点中学2026届高三化学第一学期期末达标检测试题含解析
- 情景交际公开课课件
- 人教版 2024 版历史八年级上册第二单元《早期现代化的初步探索和民族危机加剧》测试卷(附答案)
- 学校常态化疫情防控方案
- 恒丰银行反洗钱培训课件
- 小学语文第一单元的复习方案
- 2026届安徽省滁州西城区中学高一化学第一学期期末经典试题含解析
- 宣化叉车实操考试试题及答案
- 新安化工考试试题及答案
- 无领导面试题及答案
- 2025年3月医务工作者个人自传范文
- 2025年乡村全科助理医师考试题库及答案
- 钢管柱混凝土施工方案
- 2025小学《义务教育语文课程标准》(2022 年版)测试题库及答案【共3 套】
- 小学植物百科知识
- 循环水地下管道安装施工方案
- 预制板粘贴碳纤维加固计算表格
- 检验科生物安全风险评估报告
- 2024年08月北京中信银行北京分行社会招考(826)笔试历年参考题库附带答案详解
- 肾囊肿-护理查房
- 裁床岗位职责
评论
0/150
提交评论