数据仓库技术讲座_第1页
数据仓库技术讲座_第2页
数据仓库技术讲座_第3页
数据仓库技术讲座_第4页
数据仓库技术讲座_第5页
已阅读5页,还剩62页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

2020年5月16日星期六,DataWarehousingandOLAPTechnology,1,数据仓库和OLAP技术,什么是数据仓库(Whatisadatawarehouse)?多维数据模型(Amulti-dimensionaldatamodel)数据仓库体系结构(Datawarehousearchitecture)数据仓库实现(Datawarehouseimplementation)FurtherdevelopmentofdatacubetechnologyFromdatawarehousingtodatamining,2020年5月16日星期六,DataWarehousingandOLAPTechnology,2,数据库的定义,传统的数据库技术是以单一的数据资源为中心,同时进行从事务处理,批处理到决策分析的各类处理;数据库主要是为自动化,精简工作任务和高速数据采集服务的。它的运行是事务驱动,面向应用的,数据库的根本任务是完成数据操作,即及时安全地将当前事务所产生的记录保存下来。,2020年5月16日星期六,DataWarehousingandOLAPTechnology,3,两种不同的数据处理需求,计算机系统中存在着两类不同的数据处理需求,即:操作型处理(事务处理):主要是对一个或一组记录的查询和修改,这时候人们关心的是响应时间、数据的安全性和完整性;分析型处理(信息型处理):用于管理人员的决策分析,如DDS(decisionsupportsystem)、多维分析等。,2020年5月16日星期六,DataWarehousingandOLAPTechnology,4,为什么要建立数据仓库?,数据DATA,知识KNOWLEDGE,决定DECISIONS,PatternsTrendsFactsRelationsModelsAssociationsSequences,TargetMarketsFundsallocationTradingoptionsWheretoadvertiseCatalogmailinglistSalesgeography,财经的Financial经济的Economic政府Government销售分数Point-of-Sale人口统计学Demographic生活方式Lifestyle,痛苦:太多数据,无法作出正确判断!,2020年5月16日星期六,DataWarehousingandOLAPTechnology,5,WhatisDataWarehouse?,数据仓库是在企业管理和决策中面向主题的,集成的,与时间相关的和不可修改的数据集合“Adatawarehouseisasubject-oriented,integrated,time-variant,andnonvolatilecollectionofdatainsupportofmanagementsdecision-makingprocess.”W.H.InmonDatawarehousing:Theprocessofconstructingandusingdatawarehouses,2020年5月16日星期六,DataWarehousingandOLAPTechnology,6,DataWarehouseSubject-Oriented,Organizedaroundmajorsubjects,suchascustomer,product,sales.Focusingonthemodelingandanalysisofdatafordecisionmakers,notondailyoperationsortransactionprocessing.Provideasimpleandconciseviewaroundparticularsubjectissuesbyexcludingdatathatarenotusefulinthedecisionsupportprocess.,2020年5月16日星期六,DataWarehousingandOLAPTechnology,7,面向应用举例采购子系统:订单(订单号,供应商号,总金额,日期)订单细则(订单号,商品号,类别,单价,数量)供应商(供应商号,供应商名,地址,电话)销售子系统:顾客(顾客号,姓名,性别,年龄,地址,电话)销售(员工号,顾客号,商品号,数量,单价日期)库存管理子系统:领料单(领料单号,领料人,商品号,数量,日期)进料单(进料单号,订单号,进料人,收料人,日期)库存(商品号,库房号,库存量,日期)库房(库房号,仓库保管员,地点,库存商品描述)人事管理子系统:员工(员工号,姓名,性别,年龄,部门号)部门(部门号,部门名称,部门主管,电话),面向主题举例:商品:商品固有信息:商品号,商品名,类别,颜色等商品采购信息:商品号,供应商号,供应价,供应日期,供应量等商品销售信息:商品号,顾客号,售价,销售日期,销售量等商品库存信息:商品号,库房号,日期,库存量等供应商:供应商固有信息:供应商号,供应商名,地址,电话等供应商品信息:供应商号,商品号,供应价,供应日期,供应量等顾客:顾客固有信息:顾客号,顾客名,性别,年龄,住址,电话等顾客购物信息:顾客号,商品号,售价,购买日期,购买量等,2020年5月16日星期六,DataWarehousingandOLAPTechnology,8,DataWarehouseIntegrated,Constructedbyintegratingmultiple,heterogeneousdatasourcesrelationaldatabases,flatfiles,on-linetransactionrecordsDatacleaninganddataintegrationtechniquesareapplied.Ensureconsistencyinnamingconventions,encodingstructures,attributemeasures,etc.amongdifferentdatasourcesE.g.,Hotelprice:currency,tax,breakfastcovered,etc.Whendataismovedtothewarehouse,itisconverted.,2020年5月16日星期六,DataWarehousingandOLAPTechnology,9,DataWarehouseTimeVariant,Thetimehorizonforthedatawarehouseissignificantlylongerthanthatofoperationalsystems.Operationaldatabase:currentvaluedata.Datawarehousedata:provideinformationfromahistoricalperspective(e.g.,past5-10years)EverykeystructureinthedatawarehouseContainsanelementoftime,explicitlyorimplicitlyButthekeyofoperationaldatamayormaynotcontain“timeelement”.,2020年5月16日星期六,DataWarehousingandOLAPTechnology,10,DataWarehouseNon-Volatile,Aphysicallyseparatestoreofdatatransformedfromtheoperationalenvironment.Operationalupdateofdatadoesnotoccurinthedatawarehouseenvironment.Doesnotrequiretransactionprocessing,recovery,andconcurrencycontrolmechanismsRequiresonlytwooperationsindataaccessing:initialloadingofdataandaccessofdata.,2020年5月16日星期六,DataWarehousingandOLAPTechnology,11,DataWarehousevs.HeterogeneousDBMS,TraditionalheterogeneousDBintegration:Buildwrappers/mediatorsontopofheterogeneousdatabasesQuerydrivenapproachWhenaqueryisposedtoaclientsite,ameta-dictionaryisusedtotranslatethequeryintoqueriesappropriateforindividualheterogeneoussitesinvolved,andtheresultsareintegratedintoaglobalanswersetComplexinformationfiltering,competeforresourcesDatawarehouse:update-driven,highperformanceInformationfromheterogeneoussourcesisintegratedinadvanceandstoredinwarehousesfordirectqueryandanalysis,2020年5月16日星期六,DataWarehousingandOLAPTechnology,12,DataWarehousevs.OperationalDBMS,OLTP(on-linetransactionprocessing)MajortaskoftraditionalrelationalDBMSDay-to-dayoperations:purchasing,inventory,banking,manufacturing,payroll,registration,accounting,etc.OLAP(on-lineanalyticalprocessing)MajortaskofdatawarehousesystemDataanalysisanddecisionmakingDistinctfeatures(OLTPvs.OLAP):Userandsystemorientation:customervs.marketDatacontents:current,detailedvs.historical,consolidatedDatabasedesign:ER+applicationvs.star+subjectView:current,localvs.evolutionary,integratedAccesspatterns:updatevs.read-onlybutcomplexqueries,2020年5月16日星期六,DataWarehousingandOLAPTechnology,13,OLTPvs.OLAP,2020年5月16日星期六,DataWarehousingandOLAPTechnology,14,WhySeparateDataWarehouse?,HighperformanceforbothsystemsDBMStunedforOLTP:accessmethods,indexing,concurrencycontrol,recoveryWarehousetunedforOLAP:complexOLAPqueries,multidimensionalview,consolidation.Differentfunctionsanddifferentdata:missingdata:DecisionsupportrequireshistoricaldatawhichoperationalDBsdonottypicallymaintaindataconsolidation:DSrequiresconsolidation(aggregation,summarization)ofdatafromheterogeneoussourcesdataquality:differentsourcestypicallyuseinconsistentdatarepresentations,codesandformatswhichhavetobereconciled,2020年5月16日星期六,DataWarehousingandOLAPTechnology,15,DataWarehousingandOLAPTechnology,Whatisadatawarehouse?Amulti-dimensionaldatamodelDatawarehousearchitectureDatawarehouseimplementationFurtherdevelopmentofdatacubetechnologyFromdatawarehousingtodatamining,2020年5月16日星期六,DataWarehousingandOLAPTechnology,16,FromTablesandSpreadsheetstoDataCubes,AdatawarehouseisbasedonamultidimensionaldatamodelwhichviewsdataintheformofadatacubeAdatacube,suchassales,allowsdatatobemodeledandviewedinmultipledimensionsDimensiontables,suchasitem(item_name,brand,type),ortime(day,week,month,quarter,year)Facttablecontainsmeasures(suchasdollars_sold)andkeystoeachoftherelateddimensiontablesIndatawarehousingliterature,ann-Dbasecubeiscalledabasecuboid.Thetopmost0-Dcuboid,whichholdsthehighest-levelofsummarization,iscalledtheapexcuboid.Thelatticeofcuboidsformsadatacube.,2020年5月16日星期六,DataWarehousingandOLAPTechnology,17,Cube:ALatticeofCuboids,all,time,item,location,supplier,time,item,time,location,time,supplier,item,location,item,supplier,location,supplier,time,item,location,time,item,supplier,time,location,supplier,item,location,supplier,time,item,location,supplier,0-D(apex)cuboid,1-Dcuboids,2-Dcuboids,3-Dcuboids,4-D(base)cuboid,2020年5月16日星期六,DataWarehousingandOLAPTechnology,18,ConceptualModelingofDataWarehouses,Modelingdatawarehouses:dimensionsweekyearSet_groupinghierarchy1.10inexpensive,2020年5月16日星期六,DataWarehousingandOLAPTechnology,29,MultidimensionalData,Salesvolumeasafunctionofproduct,month,andregion,Product,Region,Month,Dimensions:Product,Location,TimeHierarchicalsummarizationpaths,IndustryRegionYearCategoryCountryQuarterProductCityMonthWeekOfficeDay,2020年5月16日星期六,DataWarehousingandOLAPTechnology,30,ASampleDataCube,TotalannualsalesofTVinU.S.A.,2020年5月16日星期六,DataWarehousingandOLAPTechnology,31,CuboidsCorrespondingtotheCube,all,product,date,country,product,date,product,country,date,country,product,date,country,0-D(apex)cuboid,1-Dcuboids,2-Dcuboids,3-D(base)cuboid,2020年5月16日星期六,DataWarehousingandOLAPTechnology,32,BrowsingaDataCube,VisualizationOLAPcapabilitiesInteractivemanipulation,2020年5月16日星期六,DataWarehousingandOLAPTechnology,33,TypicalOLAPOperations,Rollup(drill-up):summarizedatabyclimbinguphierarchyorbydimensionreductionDrilldown(rolldown):reverseofroll-upfromhigherlevelsummarytolowerlevelsummaryordetaileddata,orintroducingnewdimensionsSliceanddice:projectandselectPivot(rotate):reorientthecube,visualization,3Dtoseriesof2Dplanes.Otheroperationsdrillacross:involving(across)morethanonefacttabledrillthrough:throughthebottomlevelofthecubetoitsback-endrelationaltables(usingSQL),2020年5月16日星期六,DataWarehousingandOLAPTechnology,34,AStar-NetQueryModel,ShippingMethod,AIR-EXPRESS,TRUCK,ORDER,CustomerOrders,CONTRACTS,Customer,Product,PRODUCTGROUP,PRODUCTLINE,PRODUCTITEM,SALESPERSON,DISTRICT,DIVISION,Organization,Promotion,CITY,COUNTRY,REGION,Location,DAILY,QTRLY,ANNUALY,Time,Eachcircleiscalledafootprint,2020年5月16日星期六,DataWarehousingandOLAPTechnology,35,DataWarehousingandOLAPTechnologyforDataMining,Whatisadatawarehouse?Amulti-dimensionaldatamodelDatawarehousearchitectureDatawarehouseimplementationFurtherdevelopmentofdatacubetechnologyFromdatawarehousingtodatamining,2020年5月16日星期六,DataWarehousingandOLAPTechnology,36,DesignofaDataWarehouse:ABusinessAnalysisFramework,FourviewsregardingthedesignofadatawarehouseTop-downviewallowsselectionoftherelevantinformationnecessaryforthedatawarehouseDatasourceviewexposestheinformationbeingcaptured,stored,andmanagedbyoperationalsystemsDatawarehouseviewconsistsoffacttablesanddimensiontablesBusinessqueryviewseestheperspectivesofdatainthewarehousefromtheviewofend-user,2020年5月16日星期六,DataWarehousingandOLAPTechnology,37,DataWarehouseDesignProcess,Top-down,bottom-upapproachesoracombinationofbothTop-down:Startswithoveralldesignandplanning(mature)Bottom-up:Startswithexperimentsandprototypes(rapid)FromsoftwareengineeringpointofviewWaterfall:structuredandsystematicanalysisateachstepbeforeproceedingtothenextSpiral:rapidgenerationofincreasinglyfunctionalsystems,shortturnaroundtime,quickturnaroundTypicaldatawarehousedesignprocessChooseabusinessprocesstomodel,e.g.,orders,invoices,etc.Choosethegrain(atomiclevelofdata)ofthebusinessprocessChoosethedimensionsthatwillapplytoeachfacttablerecordChoosethemeasurethatwillpopulateeachfacttablerecord,2020年5月16日星期六,DataWarehousingandOLAPTechnology,38,Multi-TieredArchitecture,DataWarehouse,OLAPEngine,AnalysisQueryReportsDatamining,Monitor&Integrator,Metadata,DataSources,Front-EndTools,Serve,DataMarts,DataStorage,OLAPServer,2020年5月16日星期六,DataWarehousingandOLAPTechnology,39,体系结构Pieter,1998,数据仓库的焦点问题-数据的获得、存储和使用,Relational,Package,Legacy,Externalsource,DataCleanTool,DataStaging,EnterpriseDataWarehouse,Datamart,Datamart,RDBMSROLAP,RDBMS,数据仓库和集市的加载能力至关重要数据仓库和集市的查询输出能力至关重要,ETL工具,去掉操作型数据库中的不需要的数据统一转换数据的名称和定义计算汇总数据和派生数据估计遗失数据的缺省值调节源数据的定义变化,2020年5月16日星期六,DataWarehousingandOLAPTechnology,42,ThreeDataWarehouseModels,EnterprisewarehousecollectsalloftheinformationaboutsubjectsspanningtheentireorganizationDataMartasubsetofcorporate-widedatathatisofvaluetoaspecificgroupsofusers.Itsscopeisconfinedtospecific,selectedgroups,suchasmarketingdatamartIndependentvs.dependent(directlyfromwarehouse)datamartVirtualwarehouseAsetofviewsoveroperationaldatabasesOnlysomeofthepossiblesummaryviewsmaybematerialized,2020年5月16日星期六,DataWarehousingandOLAPTechnology,43,DataWarehouseDevelopment:ARecommendedApproach,Defineahigh-levelcorporatedatamodel,DataMart,DataMart,DistributedDataMarts,Multi-TierDataWarehouse,EnterpriseDataWarehouse,Modelrefinement,Modelrefinement,2020年5月16日星期六,DataWarehousingandOLAPTechnology,44,OLAPServerArchitectures,RelationalOLAP(ROLAP)Userelationalorextended-relationalDBMStostoreandmanagewarehousedataandOLAPmiddlewaretosupportmissingpiecesIncludeoptimizationofDBMSbackend,implementationofaggregationnavigationlogic,andadditionaltoolsandservicesgreaterscalabilityMultidimensionalOLAP(MOLAP)Array-basedmultidimensionalstorageengine(sparsematrixtechniques)fastindexingtopre-computedsummarizeddataHybridOLAP(HOLAP)Userflexibility,e.g.,lowlevel:relational,high-level:arraySpecializedSQLserversspecializedsupportforSQLqueriesoverstar/snowflakeschemas,2020年5月16日星期六,DataWarehousingandOLAPTechnology,45,DataWarehousingandOLAPTechnologyforDataMining,Whatisadatawarehouse?Amulti-dimensionaldatamodelDatawarehousearchitectureDatawarehouseimplementationFurtherdevelopmentofdatacubetechnologyFromdatawarehousingtodatamining,2020年5月16日星期六,DataWarehousingandOLAPTechnology,46,EfficientDataCubeComputation,DatacubecanbeviewedasalatticeofcuboidsThebottom-mostcuboidisthebasecuboidThetop-mostcuboid(apex)containsonlyonecellHowmanycuboidsinann-dimensionalcubewithLlevels?MaterializationofdatacubeMaterializeevery(cuboid)(fullmaterialization),none(nomaterialization),orsome(partialmaterialization)SelectionofwhichcuboidstomaterializeBasedonsize,sharing,accessfrequency,etc.,2020年5月16日星期六,DataWarehousingandOLAPTechnology,47,CubeOperation,CubedefinitionandcomputationinDMQLdefinecubesalesitem,city,year:sum(sales_in_dollars)computecubesalesTransformitintoaSQL-likelanguage(withanewoperatorcubeby,introducedbyGrayetal.96)SELECTitem,city,year,SUM(amount)FROMSALESCUBEBYitem,city,yearNeedcomputethefollowingGroup-Bys(date,product,customer),(date,product),(date,customer),(product,customer),(date),(product),(customer)(),(item),(city),(),(year),(city,item),(city,year),(item,year),(city,item,year),2020年5月16日星期六,DataWarehousingandOLAPTechnology,48,CubeComputation:ROLAP-BasedMethod,EfficientcubecomputationmethodsROLAP-basedcubingalgorithms(Agarwaletal96)Array-basedcubingalgorithm(Zhaoetal97)Bottom-upcomputationmethod(Bayer&Ramarkrishnan99)ROLAP-basedcubingalgorithmsSorting,hashing,andgroupingoperationsareappliedtothedimensionattributesinordertoreorderandclusterrelatedtuplesGroupingisperformedonsomesubaggregatesasa“partialgroupingstep”Aggregatesmaybecomputedfrompreviouslycomputedaggregates,ratherthanfromthebasefacttable,2020年5月16日星期六,DataWarehousingandOLAPTechnology,49,CubeComputation:ROLAP-BasedMethod(2),ThisisnotinthetextbookbutinaresearchpaperHash/sortbasedmethods(Agarwalet.al.VLDB96)Smallest-parent:computingacuboidfromthesmallestcubodpreviouslycomputedcuboid.Cache-results:cachingresultsofacuboidfromwhichothercuboidsarecomputedtoreducediskI/OsAmortize-scans:computingasmanyaspossiblecuboidsatthesametimetoamortizediskreadsShare-sorts:sharingsortingcostscrossmultiplecuboidswhensort-basedmethodisusedShare-partitions:sharingthepartitioningcostcrossmultiplecuboidswhenhash-basedalgorithmsareused,2020年5月16日星期六,DataWarehousingandOLAPTechnology,50,Multi-wayArrayAggregationforCubeComputation,Partitionarraysintochunks(asmallsubcubewhichfitsinmemory).Compressedsparsearrayaddressing:(chunk_id,offset)Computeaggregatesin“multiway”byvisitingcubecellsintheorderwhichminimizesthe#oftimestovisiteachcell,andreducesmemoryaccessandstoragecost.,Whatisthebesttraversingordertodomulti-wayaggregation?,2020年5月16日星期六,DataWarehousingandOLAPTechnology,51,Multi-wayArrayAggregationforCubeComputation,B,2020年5月16日星期六,DataWarehousingandOLAPTechnology,52,Multi-wayArrayAggregationforCubeComputation,A,B,29,30,31,32,1,2,3,4,5,9,13,14,15,16,64,63,62,61,48,47,46,45,a1,a0,c3,c2,c1,c0,b3,b2,b1,b0,a2,a3,C,44,28,56,40,24,52,36,20,60,B,2020年5月16日星期六,DataWarehousingandOLAPTechnology,53,Multi-WayArrayAggregationforCubeComputation(Cont.),Method:theplanesshouldbesortedandcomputedaccordingtotheirsizeinascendingorder.SeethedetailsofExample2.12(pp.75-78)Idea:keepthesmallestplaneinthemainmemory,fetchandcomputeonlyonechunkatatimeforthelargestplaneLimitationofthemethod:computingwellonlyforasmallnumberofdimensionsIftherearealargenumberofdimensions,“bottom-upcomputation”andicebergcubecomputationmethodscanbeexplored,2020年5月16日星期六,DataWarehousingandOLAPTechnology,54,IndexingOLAPData:BitmapIndex,IndexonaparticularcolumnEachvalueinthecolumnhasabitvector:bit-opisfastThelengthofthebitvector:#ofrecordsinthebasetableThei-thbitissetifthei-throwofthebasetablehasthevaluefortheindexedcolumnnotsuitableforhighcardinalitydomains,Basetable,IndexonRegion,IndexonType,2020年5月16日星期六,DataWarehousingandOLAPTechnology,55,IndexingOLAPData:JoinIndices,Joinindex:JI(R-id,S-id)whereR(R-id,)S(S-id,)TraditionalindicesmapthevaluestoalistofrecordidsItmaterializesrelationaljoininJIfileandspeedsuprelationaljoinarathercostlyoperationIndatawarehouses,joinindexrelatesthevaluesofthedimensionsofastartschematorowsinthefacttable.E.g.facttable:SalesandtwodimensionscityandproductAjoinindexoncitymaintainsforeachdistinctcityalistofR-IDsofthetuplesrecordingtheSalesinthecityJoinindicescanspanmultipledimensions,2020年5月16日星期六,DataWarehousingandOLAPTechnology,56,EfficientProcessingOLAPQueries,Determinewhichoperationsshouldbeperformedontheavailablecuboids:transformdrill,roll,ocorrespondingSQLand/orOLAPoperations,e.g,dice=selection+projectionDeterminetowhichmaterializedcuboid(s)therelevantoperationsshouldbeapplied.Exploringindexingstructuresandcompressedvs.densearraystructuresinMOLAP,2020年5月16日星期六,DataWarehousingandOLAPTechnology,57,MetadataRepository,Metadataisthedatadefiningwarehouseobjects.IthasthefollowingkindsDescriptionofthestructureofthewarehouseschema,view,dimensions,hierarchies,deriveddatadefn,datamartlocationsandcontentsOperationalmeta-datadatalineage(historyofmigrateddataandtransformationpath),currencyofdata(active,archived,orpurged),monitoringinformation(warehouseusagestatistics,errorreports,audittrails)ThealgorithmsusedforsummarizationThemappingfromoperationalenvironmenttothedatawarehouseDatarelatedtosystemperformancewarehouseschema,viewandderiveddatadefinitionsBusinessdatabusinesstermsanddefinitions,ownershipofdata,chargingpolicies,2020年5月16日星期六,DataWarehousingandOLAPTechnology,58,DataWarehouseBack-EndToolsandUtilities,Dataextraction:getdatafrommultiple,heterogeneous,andexternalsourcesDatacleaning:detecterrorsinthedataandrectifythemwhenpossibleDatatransformation:convertdatafromlegacyorhostformattowarehouseformatLoad:sort,summarize,consolidate,computeviews,checkintegrity,andbuildindiciesandpartitionsRefreshpropagatetheupdatesfromthedatasourcestothewarehouse,2020年5月16日星期六,DataWarehousingandOLAPTechnology,59,Discovery-DrivenExplorationofDataCubes,Hypothesis-driven:explorationbyuser,hugesearchspaceDiscovery-driven(Sarawagietal.98)pre-computemeasuresindicatingexceptions,guideuserinthedataanalysis,atalllevelsofaggregationException:significantlydifferentfromthevalueanticipated,basedonastatisticalmodelVis

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论