




版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
1、2021年10月16日星期六Data Mining: Concepts and Techniques1Data Mining: Concepts and Techniques Chapter 3 2021年10月16日星期六Data Mining: Concepts and Techniques2Chapter 3: Data Warehousing and OLAP Technology: An OverviewnWhat is a data warehouse? nA multi-dimensional data modelnData warehouse architecturenData
2、 warehouse implementationnFrom data warehousing to data mining2021年10月16日星期六Data Mining: Concepts and Techniques3What is Data Warehouse?n“A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile collection of data in support of managements decision-making process.”W. H. Inmo
3、nnData warehousing:nThe process of constructing and using data warehouses2021年10月16日星期六Data Mining: Concepts and Techniques4Data WarehouseSubject-OrientednOrganized around major subjects, such as customer, product, salesnProvide a simple and concise view around particular subject issues by excluding
4、 data that are not useful in the decision support process2021年10月16日星期六Data Mining: Concepts and Techniques5Data WarehouseIntegratednConstructed by integrating multiple, heterogeneous data sourcesnData cleaning and data integration techniques are applied.nEnsure consistency in naming conventions, en
5、coding structures, attribute measures, etc. among different data sources. E.g., Hotel price: currency, tax, breakfast covered, etc.2021年10月16日星期六Data Mining: Concepts and Techniques6Data WarehouseTime VariantnThe time horizon for the data warehouse is significantly longer than that of operational sy
6、stemsnData warehouse data: provide information from a historical perspective (e.g., past 5-10 years)nEvery key structure in the data warehouse Contains an element of time, explicitly or implicitly2021年10月16日星期六Data Mining: Concepts and Techniques7Data WarehouseNonvolatilenOperational update of data
7、does not occur in the data warehouse environmentnDoes not require transaction processing, recovery, and concurrency control mechanismsnRequires only two operations in data accessing: initial loading of data and access of data2021年10月16日星期六Data Mining: Concepts and Techniques8Data Warehouse vs. Heter
8、ogeneous DBMSnTraditional heterogeneous DB integration: A query driven approachnBuild wrappers/mediators on top of heterogeneous databases nComplex information filtering, compete for resourcesnData warehouse: update-driven, high performancenInformation is integrated in advance and stored in warehous
9、es for direct query and analysis2021年10月16日星期六Data Mining: Concepts and Techniques9Data Warehouse vs. Operational DBMSnOLTP (on-line transaction processing)nMajor task of traditional relational DBMSnDay-to-day operations: purchasing, inventory, banking, manufacturing, payroll, registration, accounti
10、ng, etc.nOLAP (on-line analytical processing)nMajor task of data warehouse systemnData analysis and decision making2021年10月16日星期六Data Mining: Concepts and Techniques10 OLTP OLAP users clerk, IT professional knowledge worker function day to day operations decision support DB design application-orient
11、ed subject-oriented data current, up-to-date detailed, flat relational isolated historical, summarized, multidimensional integrated, consolidated usage repetitive ad-hoc access read/write index/hash on prim. key lots of scans unit of work short, simple transaction complex query # records accessed te
12、ns millions #users thousands hundreds DB size 100MB-GB 100GB-TB metric transaction throughput query throughput, response 2021年10月16日星期六Data Mining: Concepts and Techniques11Why Separate Data Warehouse?nHigh performance for both systemsnDBMS tuned for OLTPnWarehousetuned for OLAPnDifferent functions
13、and different data:nmissing data: Decision support requires historical data which operational DBs do not typically maintainndata consolidation: DS requires consolidation (aggregation, summarization) of data from heterogeneous sourcesndata quality: different sources typically use inconsistent data re
14、presentations, codes and formats which have to be reconciled2021年10月16日星期六Data Mining: Concepts and Techniques12Chapter 3: Data Warehousing and OLAP Technology: An OverviewnWhat is a data warehouse? nA multi-dimensional data modelnData warehouse architecturenData warehouse implementationnFrom data w
15、arehousing to data mining2021年10月16日星期六Data Mining: Concepts and Techniques13From Tables and Spreadsheets to Data CubesnA data warehouse is based on a multidimensional data model which views data in the form of a data cubenA data cube, such as sales, allows data to be modeled and viewed in multiple
16、dimensionsnDimension tables, such as item (item_name, brand, type), or time(day, week, month, quarter, year) nFact table contains measures (such as dollars_sold) and keys to each of the related dimension tables2021年10月16日星期六Data Mining: Concepts and Techniques14Conceptual Modeling of Data Warehouses
17、nModeling data warehouses: dimensions & measuresnStar schema: A fact table in the middle connected to a set of dimension tables nSnowflake schema: A refinement of star schema where some dimensional hierarchy is normalized into a set of smaller dimension tablesnFact constellations: Multiple fact tabl
18、es share dimension tables2021年10月16日星期六Data Mining: Concepts and Techniques15Example of Star Schema time_keydayday_of_the_weekmonthquarteryeartimelocation_keystreetcitystate_or_provincecountrylocationSales Fact Table time_key item_key branch_key location_key units_sold dollars_sold avg_salesMeasures
19、item_keyitem_namebrandtypesupplier_typeitembranch_keybranch_namebranch_typebranch2021年10月16日星期六Data Mining: Concepts and Techniques16Example of Snowflake Schematime_keydayday_of_the_weekmonthquarteryeartimelocation_keystreetcity_keylocationSales Fact Table time_key item_key branch_key location_key u
20、nits_sold dollars_sold avg_salesMeasuresitem_keyitem_namebrandtypesupplier_keyitembranch_keybranch_namebranch_typebranchsupplier_keysupplier_typesuppliercity_keycitystate_or_provincecountrycity2021年10月16日星期六Data Mining: Concepts and Techniques17Example of Fact Constellationtime_keydayday_of_the_week
21、monthquarteryeartimelocation_keystreetcityprovince_or_statecountrylocationSales Fact Tabletime_key item_key branch_key location_key units_sold dollars_sold avg_salesMeasuresitem_keyitem_namebrandtypesupplier_typeitembranch_keybranch_namebranch_typebranchShipping Fact Tabletime_key item_key shipper_k
22、ey from_location to_location dollars_cost units_shippedshipper_keyshipper_namelocation_keyshipper_typeshipper2021年10月16日星期六Data Mining: Concepts and Techniques18Measures of Data Cube: Three CategoriesnDistributive: if the result derived by applying the function to n aggregate values is the same as t
23、hat derived by applying the function on all the data without partitioningnE.g., count(), sum(), min(), max()nAlgebraic: if it can be computed by an algebraic function with M arguments (where M is a bounded integer), each of which is obtained by applying a distributive aggregate functionnE.g., avg(),
24、 min_N(), standard_deviation()nHolistic: if there is no constant bound on the storage size needed to describe a subaggregate. nE.g., median(), mode(), rank()2021年10月16日星期六Data Mining: Concepts and Techniques19A Concept Hierarchy: Dimension (location)allEuropeNorth_AmericaMexicoCanadaSpainGermanyVanc
25、ouverM. WindL. Chan.allregionofficecountryTorontoFrankfurtcity2021年10月16日星期六Data Mining: Concepts and Techniques20Multidimensional DatanSales volume as a function of product, month, and regionProductRegionMonthDimensions: Product, Location, TimeHierarchical summarization pathsIndustry Region YearCat
26、egory Country QuarterProduct City Month Week Office Day2021年10月16日星期六Data Mining: Concepts and Techniques21A Sample Data CubeTotal annual salesof TV in U.S.A.DateProductCountrysumsum TVVCRPC1Qtr2Qtr3Qtr4QtrU.S.ACanadaMexicosum2021年10月16日星期六Data Mining: Concepts and Techniques22Cuboids Corresponding
27、to the Cubeallproductdatecountryproduct,dateproduct,countrydate, countryproduct, date, country0-D(apex) cuboid1-D cuboids2-D cuboids3-D(base) cuboid2021年10月16日星期六Data Mining: Concepts and Techniques23Browsing a Data Cube2021年10月16日星期六Data Mining: Concepts and Techniques24Typical OLAP OperationsnRoll
28、 up (drill-up): summarize datanby climbing up hierarchy or by dimension reductionnDrill down (roll down): reverse of roll-upnfrom higher level summary to lower level summary or detailed data, or introducing new dimensionsnSlice and dice: project and select 2021年10月16日星期六Data Mining: Concepts and Tec
29、hniques25Typical OLAP OperationsnPivot (rotate): nreorient the cube, visualization, 3D to series of 2D planesnOther operationsndrill across: involving (across) more than one fact tablendrill through: through the bottom level of the cube to its back-end relational tables (using SQL)2021年10月16日星期六Data
30、 Mining: Concepts and Techniques26A Star-Net Query Model Shipping MethodAIR-EXPRESSTRUCKORDERCustomer OrdersCONTRACTSCustomerProductPRODUCT GROUPPRODUCT LINEPRODUCT ITEMSALES PERSONDISTRICTDIVISIONOrganizationPromotionCITYCOUNTRYREGIONLocationDAILYQTRLYANNUALYTimeEach circle is called a footprint202
31、1年10月16日星期六Data Mining: Concepts and Techniques27Chapter 3: Data Warehousing and OLAP Technology: An OverviewnWhat is a data warehouse? nA multi-dimensional data modelnData warehouse architecturenData warehouse implementationnFrom data warehousing to data mining2021年10月16日星期六Data Mining: Concepts an
32、d Techniques28Design of Data Warehouse: A Business Analysis FrameworknFour views regarding the design of a data warehouse nTop-down viewnData source viewnData warehouse viewnconsists of fact tables and dimension tablesnBusiness query view nsees the perspectives of data in the warehouse from the view
33、 of end-user2021年10月16日星期六Data Mining: Concepts and Techniques29Data Warehouse Design Process nFrom software engineering point of viewnWaterfall: structured and systematic analysis at each step before proceeding to the nextnSpiral: rapid generation of increasingly functional systems, short turn arou
34、nd time, quick turn around2021年10月16日星期六Data Mining: Concepts and Techniques30DataWarehouseExtractTransformLoadRefreshOLAP EngineAnalysisQueryReportsData miningMonitor&IntegratorMetadataData SourcesFront-End ToolsServeData MartsOperational DBsOthersourcesData StorageOLAP Server2021年10月16日星期六Data Min
35、ing: Concepts and Techniques31OLAP Server ArchitecturesnRelational OLAP (ROLAP) nUse relational or extended-relational DBMS to store and manage warehouse data and OLAP middle warenInclude optimization of DBMS backend, implementation of aggregation navigation logic, and additional tools and servicesn
36、Greater scalability2021年10月16日星期六Data Mining: Concepts and Techniques32OLAP Server ArchitecturesnMultidimensional OLAP (MOLAP) nSparse array-based multidimensional storage engine nFast indexing to pre-computed summarized datanHybrid OLAP (HOLAP) (e.g., Microsoft SQLServer)nFlexibility, e.g., low lev
37、el: relational, high-level: arraynSpecialized SQL servers (e.g., Redbricks) nSpecialized support for SQL queries over star/snowflake schemas2021年10月16日星期六Data Mining: Concepts and Techniques33Chapter 3: Data Warehousing and OLAP Technology: An OverviewnWhat is a data warehouse? nA multi-dimensional
38、data modelnData warehouse architecturenData warehouse implementationnFrom data warehousing to data mining2021年10月16日星期六Data Mining: Concepts and Techniques34Efficient Data Cube ComputationnData cube can be viewed as a lattice of cuboids nThe bottom-most cuboid is the base cuboidnThe top-most cuboid
39、(apex) contains only one cellnHow many cuboids in an n-dimensional cube with L levels?)11(niiLT2021年10月16日星期六Data Mining: Concepts and Techniques35Efficient Data Cube ComputationnMaterialization of data cubenMaterialize every (cuboid) (full materialization), none (no materialization), or some (parti
40、al materialization)nSelection of which cuboids to materializenBased on size, sharing, access frequency, etc.2021年10月16日星期六Data Mining: Concepts and Techniques36Cube: A Lattice of Cuboidstime,itemtime,item,locationtime, item, location, supplieralltimeitemlocationsuppliertime,locationtime,supplieritem
41、,locationitem,supplierlocation,suppliertime,item,suppliertime,location,supplieritem,location,supplier0-D(apex) cuboid1-D cuboids2-D cuboids3-D cuboids4-D(base) cuboid2021年10月16日星期六Data Mining: Concepts and Techniques37Iceberg CubenComputing only the cuboid cells whose count or other aggregates satis
42、fying the condition like HAVING COUNT(*) = minsupnMotivation: Avoid explosive growth of the cubenSuppose 100 dimensions, only 1 base cell. How many aggregate cells if count = 1? What about count = 2?2021年10月16日星期六Data Mining: Concepts and Techniques38Indexing OLAP Data: Bitmap IndexnEach value in th
43、e column has a bit vector: bit-op is fastnThe length of the bit vector: # of records in the base tablennot suitable for high cardinality domainsCust Region TypeC1AsiaRetailC2EuropeDealerC3AsiaDealerC4America RetailC5EuropeDealerRecID Retail Dealer110201301410501RecIDAsia Europe America11002010310040
44、015010Base tableIndex on RegionIndex on Type2021年10月16日星期六Data Mining: Concepts and Techniques39Indexing OLAP Data: Join IndicesnJoin index: JI(R-id, S-id) where R (R-id, ) S (S-id, )nTraditional indices map the values to a list of record idsnIt materializes relational join in JI file and speeds up
45、relational join 2021年10月16日星期六Data Mining: Concepts and Techniques40Efficient Processing OLAP QueriesnDetermine which operations should be performed on the cuboidsnTransform drill, roll, etc. into corresponding SQL and/or OLAP operations, e.g., dice = selection + projectionnDetermine which materiali
46、zed cuboid(s) should be selected for OLAP op.nLet the query to be processed be on brand, province_or_state with the condition “year = 2004”, and there are 4 materialized cuboids available:1) year, item_name, city 2) year, brand, country3) year, brand, province_or_state4) item_name, province_or_state
47、 where year = 20042021年10月16日星期六Data Mining: Concepts and Techniques41Chapter 3: Data Warehousing and OLAP Technology: An OverviewnWhat is a data warehouse? nA multi-dimensional data modelnData warehouse architecturenData warehouse implementationnFrom data warehousing to data mining2021年10月16日星期六Dat
48、a Mining: Concepts and Techniques42Data Warehouse UsagenThree kinds of data warehouse applicationsnInformation processingnsupports querying, basic statistical analysis, and reporting using crosstabs, tables, charts and graphsnAnalytical processingnmultidimensional analysis of data warehouse datansup
49、ports basic OLAP operations, slice-dice, drilling, pivotingnData miningnknowledge discovery from hidden patterns nsupports associations, constructing analytical models, performing classification and prediction, and presenting the mining results using visualization tools2021年10月16日星期六Data Mining: Con
50、cepts and Techniques43From On-Line Analytical Processing (OLAP) to On Line Analytical Mining (OLAM)nWhy online analytical mining?nHigh quality of data in data warehousesnDW contains integrated, consistent, cleaned datanAvailable information processing structure surrounding data warehousesnODBC, OLED
51、B, Web accessing, service facilities, reporting and OLAP toolsnOLAP-based exploratory data analysisnMining with drilling, dicing, pivoting, etc.nOn-line selection of data mining functionsnIntegration and swapping of multiple mining functions, algorithms, and tasks2021年10月16日星期六Data Mining: Concepts
52、and Techniques44An OLAM System ArchitectureData WarehouseMeta DataMDDBOLAMEngineOLAPEngineUser GUI APIData Cube APIDatabase APIData cleaningData integrationLayer3OLAP/OLAMLayer2MDDBLayer1Data RepositoryLayer4User InterfaceFiltering&IntegrationFilteringDatabasesMining queryMining result2021年10月16日星期六
53、Data Mining: Concepts and Techniques45Chapter 3: Data Warehousing and OLAP Technology: An OverviewnWhat is a data warehouse? nA multi-dimensional data modelnData warehouse architecturenData warehouse implementationnFrom data warehousing to data miningnSummary2021年10月16日星期六Data Mining: Concepts and T
54、echniques46Summary: Data Warehouse and OLAP TechnologynWhy data warehousing?nA multi-dimensional model of a data warehousenStar schema, snowflake schema, fact constellationsnA data cube consists of dimensions & measuresnOLAP operations: drilling, rolling, slicing, dicing and pivotingnData warehouse
55、architecturenOLAP servers: ROLAP, MOLAP, HOLAPnEfficient computation of data cubesnPartial vs. full vs. no materializationnIndexing OALP data: Bitmap index and join indexnOLAP query processing nFrom OLAP to OLAM (on-line analytical mining)2021年10月16日星期六Data Mining: Concepts and Techniques47References (I)nS. Agarwal, R. Agrawal, P. M. Deshpande, A. Gupta, J. F. Naughton, R. Ramakrishnan, and S. Sarawagi. On the computation of multidimensional aggregates. VLDB96nD. Agrawal, A. E. Abbadi, A. Singh, and T. Yurek. Efficient view maintenance in data warehouses. SIGMOD97n
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- 门窗安装合伙协议书
- 酒店早茶承包协议书
- 邓州房屋认定协议书
- 分公司私下入股协议书
- 超市索赔和解协议书
- 转让手工工厂协议书
- 退租装修恢复协议书
- 高校帮扶县区协议书
- 金融公司代理协议书
- 餐饮经营占股协议书
- 2025至2030年中国全身螺旋CT扫描系统行业投资前景及策略咨询研究报告
- 1、人教部编版二年级下册语文看拼音写词语(一类生字和书后词语)
- 增材制造在虚拟现实辅助机械制造中的应用-洞察阐释
- 重庆金太阳2025届高三5月联考英语及答案
- 医院新建门急诊医技综合楼工程施工组织设计
- 外籍人员雇佣合同(中英文对照)6篇
- 装饰装修三级安全教育培训考试
- 尔雅《尊重学术道德遵守学术规范》期末考试答案0001
- 关联交易模板详解
- 政治经济学计算题附答案
- 人教版数学四年级下册7、8、9单元综合测试卷
评论
0/150
提交评论