




版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
1、July 22, 2020,Data Mining: Concepts and Techniques,1,Data Mining: Concepts and Techniques Slides for Textbook Chapter 2 ,Jiawei Han and Micheline Kamber Department of Computer Science University of Illinois at Urbana-Champaign /hanj,July 22, 2020,Data Mining: Concepts and Techniques,2
2、,Chapter 2: Data Warehousing and OLAP Technology for Data Mining,What is a data warehouse? A multi-dimensional data model Data warehouse architecture Data warehouse implementation Further development of data cube technology From data warehousing to data mining,July 22, 2020,Data Mining: Concepts and
3、 Techniques,3,What is Data Warehouse?,Defined in many different ways, but not rigorously. A decision support database that is maintained separately from the organizations operational database Support information processing by providing a solid platform of consolidated, historical data for analysis.
4、“A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile collection of data in support of managements decision-making process.”W. H. Inmon Data warehousing: The process of constructing and using data warehouses,July 22, 2020,Data Mining: Concepts and Techniques,4,Data Wareh
5、ouseSubject-Oriented,Organized around major subjects, such as customer, product, sales. Focusing on the modeling and analysis of data for decision makers, not on daily operations or transaction processing. Provide a simple and concise view around particular subject issues by excluding data that are
6、not useful in the decision support process.,July 22, 2020,Data Mining: Concepts and Techniques,5,Data WarehouseIntegrated,Constructed by integrating multiple, heterogeneous data sources relational databases, flat files, on-line transaction records Data cleaning and data integration techniques are ap
7、plied. Ensure consistency in naming conventions, encoding structures, attribute measures, etc. among different data sources E.g., Hotel price: currency, tax, breakfast covered, etc. When data is moved to the warehouse, it is converted.,July 22, 2020,Data Mining: Concepts and Techniques,6,Data Wareho
8、useTime Variant,The time horizon for the data warehouse is significantly longer than that of operational systems. Operational database: current value data. Data warehouse data: provide information from a historical perspective (e.g., past 5-10 years) Every key structure in the data warehouse Contain
9、s an element of time, explicitly or implicitly But the key of operational data may or may not contain “time element”.,July 22, 2020,Data Mining: Concepts and Techniques,7,Data WarehouseNon-Volatile,A physically separate store of data transformed from the operational environment. Operational update o
10、f data does not occur in the data warehouse environment. Does not require transaction processing, recovery, and concurrency control mechanisms Requires only two operations in data accessing: initial loading of data and access of data.,July 22, 2020,Data Mining: Concepts and Techniques,8,Data Warehou
11、se vs. Heterogeneous DBMS,Traditional heterogeneous DB integration: Build wrappers/mediators on top of heterogeneous databases Query driven approach When a query is posed to a client site, a meta-dictionary is used to translate the query into queries appropriate for individual heterogeneous sites in
12、volved, and the results are integrated into a global answer set Complex information filtering, compete for resources Data warehouse: update-driven, high performance Information from heterogeneous sources is integrated in advance and stored in warehouses for direct query and analysis,July 22, 2020,Da
13、ta Mining: Concepts and Techniques,9,Data Warehouse vs. Operational DBMS,OLTP (on-line transaction processing) Major task of traditional relational DBMS Day-to-day operations: purchasing, inventory, banking, manufacturing, payroll, registration, accounting, etc. OLAP (on-line analytical processing)
14、Major task of data warehouse system Data analysis and decision making Distinct features (OLTP vs. OLAP): User and system orientation: customer vs. market Data contents: current, detailed vs. historical, consolidated Database design: ER + application vs. star + subject View: current, local vs. evolut
15、ionary, integrated Access patterns: update vs. read-only but complex queries,July 22, 2020,Data Mining: Concepts and Techniques,10,OLTP vs. OLAP,July 22, 2020,Data Mining: Concepts and Techniques,11,Why Separate Data Warehouse?,High performance for both systems DBMS tuned for OLTP: access methods, i
16、ndexing, concurrency control, recovery Warehousetuned for OLAP: complex OLAP queries, multidimensional view, consolidation. Different functions and different data: missing data: Decision support requires historical data which operational DBs do not typically maintain data consolidation: DS requires
17、consolidation (aggregation, summarization) of data from heterogeneous sources data quality: different sources typically use inconsistent data representations, codes and formats which have to be reconciled,July 22, 2020,Data Mining: Concepts and Techniques,12,Chapter 2: Data Warehousing and OLAP Tech
18、nology for Data Mining,What is a data warehouse? A multi-dimensional data model Data warehouse architecture Data warehouse implementation Further development of data cube technology From data warehousing to data mining,July 22, 2020,Data Mining: Concepts and Techniques,13,From Tables and Spreadsheet
19、s to Data Cubes,A data warehouse is based on a multidimensional data model which views data in the form of a data cube A data cube, such as sales, allows data to be modeled and viewed in multiple dimensions Dimension tables, such as item (item_name, brand, type), or time(day, week, month, quarter, y
20、ear) Fact table contains measures (such as dollars_sold) and keys to each of the related dimension tables In data warehousing literature, an n-D base cube is called a base cuboid. The top most 0-D cuboid, which holds the highest-level of summarization, is called the apex cuboid. The lattice of cuboi
21、ds forms a data cube.,July 22, 2020,Data Mining: Concepts and Techniques,14,Cube: A Lattice of Cuboids,all,time,item,location,supplier,time,item,time,location,time,supplier,item,location,item,supplier,location,supplier,time,item,location,time,item,supplier,time,location,supplier,item,location,suppli
22、er,time, item, location, supplier,0-D(apex) cuboid,1-D cuboids,2-D cuboids,3-D cuboids,4-D(base) cuboid,July 22, 2020,Data Mining: Concepts and Techniques,15,Conceptual Modeling of Data Warehouses,Modeling data warehouses: dimensions week year Set_grouping hierarchy 1.10 inexpensive,July 22, 2020,Da
23、ta Mining: Concepts and Techniques,26,Multidimensional Data,Sales volume as a function of product, month, and region,Product,Region,Month,Dimensions: Product, Location, Time Hierarchical summarization paths,Industry Region Year Category Country Quarter Product City Month Week Office Day,July 22, 202
24、0,Data Mining: Concepts and Techniques,27,A Sample Data Cube,Total annual sales of TV in U.S.A.,July 22, 2020,Data Mining: Concepts and Techniques,28,Cuboids Corresponding to the Cube,all,product,date,country,product,date,product,country,date, country,product, date, country,0-D(apex) cuboid,1-D cubo
25、ids,2-D cuboids,3-D(base) cuboid,July 22, 2020,Data Mining: Concepts and Techniques,29,Browsing a Data Cube,Visualization OLAP capabilities Interactive manipulation,July 22, 2020,Data Mining: Concepts and Techniques,30,Typical OLAP Operations,Roll up (drill-up): summarize data by climbing up hierarc
26、hy or by dimension reduction Drill down (roll down): reverse of roll-up from higher level summary to lower level summary or detailed data, or introducing new dimensions Slice and dice: project and select Pivot (rotate): reorient the cube, visualization, 3D to series of 2D planes. Other operations dr
27、ill across: involving (across) more than one fact table drill through: through the bottom level of the cube to its back-end relational tables (using SQL),July 22, 2020,Data Mining: Concepts and Techniques,31,A Star-Net Query Model,Shipping Method,AIR-EXPRESS,TRUCK,ORDER,Customer Orders,CONTRACTS,Cus
28、tomer,Product,PRODUCT GROUP,PRODUCT LINE,PRODUCT ITEM,SALES PERSON,DISTRICT,DIVISION,Organization,Promotion,CITY,COUNTRY,REGION,Location,DAILY,QTRLY,ANNUALY,Time,Each circle is called a footprint,July 22, 2020,Data Mining: Concepts and Techniques,32,Chapter 2: Data Warehousing and OLAP Technology fo
29、r Data Mining,What is a data warehouse? A multi-dimensional data model Data warehouse architecture Data warehouse implementation Further development of data cube technology From data warehousing to data mining,July 22, 2020,Data Mining: Concepts and Techniques,33,Design of a Data Warehouse: A Busine
30、ss Analysis Framework,Four views regarding the design of a data warehouse Top-down view allows selection of the relevant information necessary for the data warehouse Data source view exposes the information being captured, stored, and managed by operational systems Data warehouse view consists of fa
31、ct tables and dimension tables Business query view sees the perspectives of data in the warehouse from the view of end-user,July 22, 2020,Data Mining: Concepts and Techniques,34,Data Warehouse Design Process,Top-down, bottom-up approaches or a combination of both Top-down: Starts with overall design
32、 and planning (mature) Bottom-up: Starts with experiments and prototypes (rapid) From software engineering point of view Waterfall: structured and systematic analysis at each step before proceeding to the next Spiral: rapid generation of increasingly functional systems, short turn around time, quick
33、 turn around Typical data warehouse design process Choose a business process to model, e.g., orders, invoices, etc. Choose the grain (atomic level of data) of the business process Choose the dimensions that will apply to each fact table record Choose the measure that will populate each fact table re
34、cord,July 22, 2020,Data Mining: Concepts and Techniques,35,Multi-Tiered Architecture,Data Warehouse,OLAP Engine,Analysis Query Reports Data mining,Monitor or otherwise, sumk(c) v Others: conjunctions of multiple conditions,July 22, 2020,Data Mining: Concepts and Techniques,72,Discussion: Other Issue
35、s,Computing iceberg cubes with more complex measures? No general answer for holistic measures, e.g., median, mode, rank A research theme even for complex algebraic functions, e.g., standard_dev, variance Dynamic vs . static computation of iceberg cubes v and k are only available at query time Settin
36、g reasonably low parameters for most nontrivial cases Memory-hog? what if the cubing is too big to fit in memory?projection and then cubing,July 22, 2020,Data Mining: Concepts and Techniques,73,Condensed Cube,W. Wang, H. Lu, J. Feng, J. X. Yu, Condensed Cube: An Effective Approach to Reducing Data C
37、ube Size. ICDE02. Icerberg cube cannot solve all the problems Suppose 100 dimensions, only 1 base cell with count = 10. How many aggregate (non-base) cells if count = 10? Condensed cube Only need to store one cell (a1, a2, , a100, 10), which represents all the corresponding aggregate cells Adv. Full
38、y precomputed cube without compression Efficient computation of the minimal condensed cube,July 22, 2020,Data Mining: Concepts and Techniques,74,Chapter 2: Data Warehousing and OLAP Technology for Data Mining,What is a data warehouse? A multi-dimensional data model Data warehouse architecture Data w
39、arehouse implementation Further development of data cube technology From data warehousing to data mining,July 22, 2020,Data Mining: Concepts and Techniques,75,Data Warehouse Usage,Three kinds of data warehouse applications Information processing supports querying, basic statistical analysis, and rep
40、orting using crosstabs, tables, charts and graphs Analytical processing multidimensional analysis of data warehouse data supports basic OLAP operations, slice-dice, drilling, pivoting Data mining knowledge discovery from hidden patterns supports associations, constructing analytical models, performi
41、ng classification and prediction, and presenting the mining results using visualization tools. Differences among the three tasks,July 22, 2020,Data Mining: Concepts and Techniques,76,From On-Line Analytical Processing to On Line Analytical Mining (OLAM),Why online analytical mining? High quality of
42、data in data warehouses DW contains integrated, consistent, cleaned data Available information processing structure surrounding data warehouses ODBC, OLEDB, Web accessing, service facilities, reporting and OLAP tools OLAP-based exploratory data analysis mining with drilling, dicing, pivoting, etc. O
43、n-line selection of data mining functions integration and swapping of multiple mining functions, algorithms, and tasks. Architecture of OLAM,July 22, 2020,Data Mining: Concepts and Techniques,77,An OLAM Architecture,Data Warehouse,Meta Data,MDDB,OLAM Engine,OLAP Engine,User GUI API,Data Cube API,Dat
44、abase API,Data cleaning,Data integration,Layer3 OLAP/OLAM,Layer2 MDDB,Layer1 Data Repository,Layer4 User Interface,Filtering&Integration,Filtering,Databases,Mining query,Mining result,July 22, 2020,Data Mining: Concepts and Techniques,78,Discovery-Driven Exploration of Data Cubes,Hypothesis-driven e
45、xploration by user, huge search space Discovery-driven (Sarawagi, et al.98) Effective navigation of large OLAP data cubes pre-compute measures indicating exceptions, guide user in the data analysis, at all levels of aggregation Exception: significantly different from the value anticipated, based on
46、a statistical model Visual cues such as background color are used to reflect the degree of exception of each cell,July 22, 2020,Data Mining: Concepts and Techniques,79,Kinds of Exceptions and their Computation,Parameters SelfExp: surprise of cell relative to other cells at same level of aggregation
47、InExp: surprise beneath the cell PathExp: surprise beneath cell for each drill-down path Computation of exception indicator (modeling fitting and computing SelfExp, InExp, and PathExp values) can be overlapped with cube construction Exception themselves can be stored, indexed and retrieved like prec
48、omputed aggregates,July 22, 2020,Data Mining: Concepts and Techniques,80,Examples: Discovery-Driven Data Cubes,July 22, 2020,Data Mining: Concepts and Techniques,81,Complex Aggregation at Multiple Granularities: Multi-Feature Cubes,Multi-feature cubes (Ross, et al. 1998): Compute complex queries inv
49、olving multiple dependent aggregates at multiple granularities Ex. Grouping by all subsets of item, region, month, find the maximum price in 1997 for each group, and the total sales among all maximum price tuples select item, region, month, max(price), sum(R.sales) from purchases where year = 1997 c
50、ube by item, region, month: R such that R.price = max(price) Continuing the last example, among the max price tuples, find the min and max shelf live, and find the fraction of the total sales due to tuple that have min shelf life within the set of all max price tuples,July 22, 2020,Data Mining: Conc
51、epts and Techniques,82,Cube-Gradient (Cubegrade),Analysis of changes of sophisticated measures in multi-dimensional spaces Query: changes of average house price in Vancouver in 00 comparing against 99 Answer: Apts in West went down 20%, houses in Metrotown went up 10% Cubegrade problem by Imielinski
52、 et al. Changes in dimensions changes in measures Drill-down, roll-up, and mutation,July 22, 2020,Data Mining: Concepts and Techniques,83,From Cubegrade to Multi-dimensional Constrained Gradients in Data Cubes,Significantly more expressive than association rules Capture trends in user-specified meas
53、ures Serious challenges Many trivial cells in a cube “significance constraint” to prune trivial cells Numerate pairs of cells “probe constraint” to select a subset of cells to examine Only interesting changes wanted “gradient constraint” to capture significant changes,July 22, 2020,Data Mining: Conc
54、epts and Techniques,84,MD Constrained Gradient Mining,Significance constraint Csig: (cnt100) Probe constraint Cprb: (city=“Van”, cust_grp=“busi”, prod_grp=“*”) Gradient constraint Cgrad(cg, cp): (avg_price(cg)/avg_price(cp)1.3),Base cell,Aggregated cell,Siblings,Ancestor,Probe cell: satisfied Cprb,(
55、c4, c2) satisfies Cgrad!,July 22, 2020,Data Mining: Concepts and Techniques,85,A LiveSet-Driven Algorithm,Compute probe cells using Csig and Cprb The set of probe cells P is often very small Use probe P and constraints to find gradients Pushing selection deeply Set-oriented processing for probe cell
56、s Iceberg growing from low to high dimensionalities Dynamic pruning probe cells during growth Incorporating efficient iceberg cubing method,July 22, 2020,Data Mining: Concepts and Techniques,86,Summary,Data warehouse A multi-dimensional model of a data warehouse Star schema, snowflake schema, fact c
57、onstellations A data cube consists of dimensions & measures OLAP operations: drilling, rolling, slicing, dicing and pivoting OLAP servers: ROLAP, MOLAP, HOLAP Efficient computation of data cubes Partial vs. full vs. no materialization Multiway array aggregation Bitmap index and join index implementa
58、tions Further development of data cube technology Discovery-drive and multi-feature cubes From OLAP to OLAM (on-line analytical mining),July 22, 2020,Data Mining: Concepts and Techniques,87,References (I),S. Agarwal, R. Agrawal, P. M. Deshpande, A. Gupta, J. F. Naughton, R. Ramakrishnan, and S. Sarawagi. On the computation of multidimensional aggregates. VLDB96 D. Agrawal, A. E. Abbadi, A. Singh, and T. Yurek. Efficient view maintenance in data warehouses. SIGMOD97. R. Agrawal, A. Gupta, and S. Sa
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- 营养缺乏与甲状腺功能
- 企业培训年度总结课件
- 英语九年级全一册作文范文(21篇)
- 餐饮行业品牌营销与技术合作合同
- 车辆借用与代驾服务协议合同
- 车辆抵押贷款合同争议解决条款
- 餐饮废弃物处理与城市景观美化项目合同
- 草莓种子种植技术专利授权与销售合同
- 车库租赁与车位租赁合同模板
- 金融理财产品销售代理合同参考文本
- 儿童版心肺复苏课件
- 桌游店创业初期计划书
- 高中拔尖创新人才培养模式的探索与实践研究
- 建筑工程危险源辨识与风险评价表2024版
- 道路维修施工安全措施
- 2025年中国操作系统市场调查研究报告
- 2025-2030中国高流量呼吸湿化治疗仪行业市场现状分析及竞争格局与投资发展研究报告
- 代谢性疾病的风险评估与健康管理
- 2025年氢溴酸行业市场需求分析报告及未来五至十年行业预测报告
- 药学技师考试题及答案
- 2025春季学期国开电大专科《管理学基础》期末纸质考试总题库
评论
0/150
提交评论