




已阅读5页,还剩81页未读, 继续免费阅读
版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
Data Mining: Principles and Algorithms Chapter 10.1 Mining Object, Spatial, and Multimedia Data Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign /hanj Date1Data Mining: Principles and Algorithms Date2Data Mining: Principles and Algorithms Mining Object, Spatial and Multi-Media Data nMining object data sets nMining spatial databases and data warehouses nSpatial DBMS nSpatial Data Warehousing nSpatial Data Mining nSpatiotemporal Data Mining nMining multimedia data nSummary Date3Data Mining: Principles and Algorithms Mining Complex Data Objects: Generalization of Structured Data nSet-valued attribute nGeneralization of each value in the set into its corresponding higher-level concepts nDerivation of the general behavior of the set, such as the number of elements in the set, the types or value ranges in the set, or the weighted average for numerical data nE.g., hobby = tennis, hockey, chess, violin, PC_games generalizes to sports, music, e_games nList-valued or a sequence-valued attribute nSame as set-valued attributes except that the order of the elements in the sequence should be observed in the generalization Date4Data Mining: Principles and Algorithms Generalizing Spatial and Multimedia Data nSpatial data: nGeneralize detailed geographic points into clustered regions, such as business, residential, industrial, or agricultural areas, according to land usage nRequire the merge of a set of geographic areas by spatial operations nImage data: nExtracted by aggregation and/or approximation nSize, color, shape, texture, orientation, and relative positions and structures of the contained objects or regions in the image nMusic data: nSummarize its melody: based on the approximate patterns that repeatedly occur in the segment nSummarized its style: based on its tone, tempo, or the major musical instruments played Date5Data Mining: Principles and Algorithms Generalizing Object Data nObject identifier ngeneralize to the lowest level of class in the class/subclass hierarchies nClass composition hierarchies ngeneralize only those closely related in semantics to the current one nConstruction and mining of object cubes nExtend the attribute-oriented induction method nApply a sequence of class-based generalization operators on different attributes nContinue until getting a small number of generalized objects that can be summarized as a concise in high-level terms nImplementation nExamine each attribute, generalize it to simple-valued data nConstruct a multidimensional data cube (object cube) nProblem: it is not always desirable to generalize a set of values to single-valued data Date6Data Mining: Principles and Algorithms Ex.: Plan Mining by Divide and Conquer nPlan: a sequence of actions nE.g., Travel (flight): nPlan mining: extraction of important or significant generalized (sequential) patterns from a planbase (a large collection of plans) nE.g., Discover travel patterns in an air flight database, or nfind significant patterns from the sequences of actions in the repair of automobiles nMethod nAttribute-oriented induction on sequence data nA generalized travel plan: nDivide & conquer:Mine characteristics for each subsequence nE.g., big*: same airline, small-big: nearby region Date7Data Mining: Principles and Algorithms A Travel Database for Plan Mining nExample: Mining a travel planbase Travel plan table Airport info table Date8Data Mining: Principles and Algorithms Multidimensional Analysis nStrategy nGeneralize the planbase in different directions nLook for sequential patterns in the generalized plans nDerive high-level plans A multi-D model for the planbase Date9Data Mining: Principles and Algorithms Multidimensional Generalization Multi-Dimensional generalization of the planbase Merging consecutive, identical actions in plans Date10Data Mining: Principles and Algorithms Generalization-Based Sequence Mining nGeneralize planbase in multidimensional way using dimension tables nUse # of distinct values (cardinality) at each level to determine the right level of generalization (level- “planning”) nUse operators merge “+”, option “” to further generalize patterns nRetain patterns with significant support Date11Data Mining: Principles and Algorithms Generalized Sequence Patterns nAirportSize-sequence survives the min threshold (after applying merge operator): S-L+-S 35%, L+-S 30%, S-L+ 24.5%, L+ 9% nAfter applying option operator: S-L+-S 98.5% nMost of the time, people fly via large airports to get to final destination nOther plans: 1.5% of chances, there are other patterns: S-S, L-S-L Date12Data Mining: Principles and Algorithms Mining Object, Spatial and Multi-Media Data nMining object data sets nMining spatial databases and data warehouses nSpatial DBMS nSpatial Data Warehousing nSpatial Data Mining nSpatiotemporal Data Mining nMining multimedia data nSummary Date13Data Mining: Principles and Algorithms What Is a Spatial Database System? nGeometric, geographic or spatial data: space-related data nExample: Geographic space (2-D abstraction of earth surface), VLSI design, model of human brain, 3-D space representing the arrangement of chains of protein molecule. nSpatial database system vs. image database systems. nImage database system: handling digital raster image (e.g., satellite sensing, computer tomography), may also contain techniques for object analysis and extraction from images and some spatial database functionality. nSpatial (geometric, geographic) database system: handling objects in space that have identity and well-defined extents, locations, and relationships. Date14Data Mining: Principles and Algorithms GIS (Geographic Information System) nGIS (Geographic Information System) n Analysis and visualization of geographic data n Common analysis functions of GIS n Search (thematic search, search by region) n Location analysis (buffer, corridor, overlay) n Terrain analysis (slope/aspect, drainage network) n Flow analysis (connectivity, shortest path) n Distribution (nearest neighbor, proximity, change detection) n Spatial analysis/statistics (pattern, centrality, similarity, topology) n Measurements (distance, perimeter, shape, adjacency, direction) Date15Data Mining: Principles and Algorithms Spatial DBMS (SDBMS) n SDBMS is a software system that n supports spatial data models, spatial ADTs, and a query language supporting them n supports spatial indexing, spatial operations efficiently, and query optimization n can work with an underlying DBMS n Examples n Oracle Spatial Data Catridge n ESRI Spatial Data Engine Date16Data Mining: Principles and Algorithms Modeling Spatial Objects nWhat needs to be represented? nTwo important alternative views nSingle objects: distinct entities arranged in space each of which has its own geometric description nmodeling cities, forests, rivers nSpatially related collection of objects: describe space itself (about every point in space) nmodeling land use, partition of a country into districts Date17Data Mining: Principles and Algorithms Modeling Single Objects: Point, Line and Region nPoint: location only but not extent nLine (or a curve usually represented by a polyline, a sequence of line segment): nmoving through space, or connections in space (roads, rivers, cables, etc.) nRegion: nSomething having extent in 2D-space (country, lake, park). It may have a hole or consist of several disjoint pieces. Date18Data Mining: Principles and Algorithms Modeling Spatially Related Collection of Objects nModeling spatially related collection of objects: plane partitions and networks. nA partition: a set of region objects that are required to be disjoint (e.g., a thematic map). There exist often pairs of objects with a common boundary (adjacency relationship). nA network: a graph embedded into the plane, consisting of a set of point objects, forming its nodes, and a set of line objects describing the geometry of the edges, e.g., highways. rivers, power supply lines. nOther interested spatially related collection of objects: nested partitions, or a digital terrain (elevation) model. Date19Data Mining: Principles and Algorithms Spatial Data Types and Models n Field-based model: raster data n framework: partitioning of space n Object-based model: vector model n point, line, polygon, Objects, Attributes Date20Data Mining: Principles and Algorithms Spatial Query Language n Spatial query language n Spatial data types, e.g. point, line segment, polygon, n Spatial operations, e.g. overlap, distance, nearest neighbor, n Callable from a query language (e.g. SQL3) of underlying DBMS SELECT S.name FROMSenator S WHERE S.district.Area() 300 n Standards n SQL3 (a.k.a. SQL 1999) is a standard for query languages n OGIS is a standard for spatial data types and operators n Both standards enjoy wide support in industry Date21Data Mining: Principles and Algorithms Spatial Data Types by OGIS Date22Data Mining: Principles and Algorithms Query Processing n Efficient algorithms to answer spatial queries n Common Strategy: filter and refine n Filter: Query Region overlaps with MBRs (minimum bounding rectangles) of B, C, D n Refine: Query Region overlaps with B, C Date23Data Mining: Principles and Algorithms Join Query Processing n Determining Intersection Rectangle n Plane Sweep Algorithm n Place sweep filter identifies 5 intersections for refinement step Date24Data Mining: Principles and Algorithms File Organization and Indices n SDBMS: Dataset is in the secondary storage, e.g. disk n Space Filling Curves: An ordering on the locations in a multi-dimensional space n Linearize a multi-dimensional space n Helps search efficiently Date25Data Mining: Principles and Algorithms File Organization and Indices n Spatial Indexing n B-tree works on spatial data with space filling curve n R-tree: Heighted balanced extention of B+ tree n Objects are represented as MBR n provides better performance Date26Data Mining: Principles and Algorithms Spatial Query Optimization n A spatial operation can be processed using different strategies n Computation cost of each strategy depends on many parameters n Query optimization is the process of n ordering operations in a query and n selecting efficient strategy for each operation n based on the details of a given dataset Date27Data Mining: Principles and Algorithms Spatial Data Warehousing nSpatial data warehouse: Integrated, subject-oriented, time-variant, and nonvolatile spatial data repository nSpatial data integration: a big issue nStructure-specific formats (raster- vs. vector-based, OO vs. relational models, different storage and indexing, etc.) nVendor-specific formats (ESRI, MapInfo, Integraph, IDRISI, etc.) nGeo-specific formats (geographic vs. equal area projection, etc.) nSpatial data cube: multidimensional spatial database nBoth dimensions and measures may contain spatial components Date28Data Mining: Principles and Algorithms Dimensions and Measures in Spatial Data Warehouse nDimensions nnon-spatial ne.g. “25-30 degrees” generalizes to“hot” (both are strings) nspatial-to-nonspatial ne.g. Seattle generalizes to description “Pacific Northwest” (as a string) nspatial-to-spatial ne.g. Seattle generalizes to Pacific Northwest (as a spatial region) nMeasures nnumerical (e.g. monthly revenue of a region) ndistributive (e.g. count, sum) nalgebraic (e.g. average) nholistic (e.g. median, rank) nspatial ncollection of spatial pointers (e.g. pointers to all regions with temperature of 25-30 degrees in July) Date29Data Mining: Principles and Algorithms Spatial-to-Spatial Generalization nGeneralize detailed geographic points into clustered regions, such as businesses, residential, industrial, or agricultural areas, according to land usage nRequires the merging of a set of geographic areas by spatial operations Dissolve Merge Clip Intersect Union Date30Data Mining: Principles and Algorithms Example: British Columbia Weather Pattern Analysis nInput nA map with about 3,000 weather probes scattered in B.C. nDaily data for temperature, precipitation, wind velocity, etc. nData warehouse using star schema nOutput nA map that reveals patterns: merged (similar) regions nGoals nInteractive analysis (drill-down, slice, dice, pivot, roll-up) nFast response time nMinimizing storage space used nChallenge nA merged region may contain hundreds of “primitive” regions (polygons) Date31Data Mining: Principles and Algorithms Star Schema of the BC Weather Warehouse nSpatial data warehouse nDimensions nregion_name ntime ntemperature nprecipitation nMeasurements nregion_map narea ncount Fact tableDimension table Date32Data Mining: Principles and Algorithms Dynamic Merging of Spatial Objects Materializing (precomputing) all?too much storage space On-line merge?slow, expensive Precompute rough approximations? accuracy trade off A better way: object-based, selective (partial) materialization Date33Data Mining: Principles and Algorithms Methods for Computing Spatial Data Cubes nOn-line aggregation: collect and store pointers to spatial objects in a spatial data cube nexpensive and slow, need efficient aggregation techniques nPrecompute and store all the possible combinations nhuge space overhead nPrecompute and store rough approximations in a spatial data cube naccuracy trade-off nSelective computation: only materialize those which will be accessed frequently na reasonable choice Date34Data Mining: Principles and Algorithms Spatial Association Analysis nSpatial association rule:A B s%, c% nA and B are sets of spatial or non-spatial predicates nTopological relations: intersects, overlaps, disjoint, etc. nSpatial orientations: left_of, west_of, under, etc. nDistance information: close_to, within_distance, etc. ns% is the support and c% is the confidence of the rule nExamples 1) is_a(x, large_town) intersect(x, highway) adjacent_to(x, water) 7%, 85% 2) What kinds of objects are typically located close to golf courses? Date35Data Mining: Principles and Algorithms Progressive Refinement Mining of Spatial Association Rules nHierarchy of spatial relationship: ng_close_to: near_by, touch, intersect, contain, etc. nFirst search for rough relationship and then refine it nTwo-step mining of spatial association: nStep 1: Rough spatial computation (as a filter) n Using MBR or R-tree for rough estimation nStep2: Detailed spatial algorithm (as refinement) n Apply only to those objects which have passed the rough spatial association test (no less than min_support) Date36Data Mining: Principles and Algorithms Mining Spatial Co-location Rules nCo-location rule is similar to association rule but explore more relying spatial auto-correlation nIt leads to efficient processing nIt can be integrated with progressive refinement to further improve its performance nSpatial co-location mining idea can be applied to clustering, classification, outlier analysis and other potential mining tasks Date37Data Mining: Principles and Algorithms Spatial Autocorrelation nSpatial data tends to be highly self-correlated nExample: Neighborhood, Temperature nItems in a traditional data are independent of each other, whereas properties of locations in a map are often “auto-correlated”. nFirst law of geography: “Everything is related to everything, but nearby things are more related than distant things.” Date38Data Mining: Principles and Algorithms Spatial Autocorrelation (contd) Date39Data Mining: Principles and Algorithms nMethods in classification nDecision-tree classification, Nave-Bayesian classifier + boosting, neural network, logistic regression, etc. nAssociation-based multi-dimensional classification - Example: classifying house value based on proximity to lakes, highways, mountains, etc. nAssuming learning samples are independent of each other nSpatial auto-correlation violates this assumption! nPopular spatial classification methods nSpatial auto-regression (SAR) nMarkov random field (MRF) Spatial Classification Date40Data Mining: Principles and Algorithms Spatial Auto-Regression nLinear Regression Y=X + nSpatial autoregressive regression (SAR) Y = WY + X + nW: neighborhood matrix. n models strength of spatial dependencies n error vector The estimates of and can be derived using maximum likelihood theory or Bayesian statistics Date41Data Mining: Principles and Algorithms Markov Random Field Based Bayesian Classifiers nBayesian classifiers nMRF nA set of random variables whose interdependency relationship is represented by an undirected graph (i.e., a symmetric neighborhood matrix) is called a Markov Random Field. nLi denotes set of labels in the neighborhood of si excluding labels at si nPr(Ci | Li) can be estimated from training data by examine the ratios of the frequencies of class labels to the total number of locations nPr(X|Ci, Li) can be estimated using kernel functions from the observed values in the training dataset Date42Data Mining: Principles and Algorithms SAR v.s. MRF Date43Data Mining: Principles and Algorithms nFunction nDetect changes and trends along a spatial dimension nStudy the trend of non-spatial or spatial data changing with space nApplication examples nObserve the trend of changes of the climate or vegetation with increasing distance from an ocean nCrime rate or unemployment rate change with regard to city geo-distribution Spatial Trend Analysis Date44Data Mining: Principles and Algorithms Spatial Cluster Analysis nMining clustersk-means, k-medoids, hierarchical, density-based, etc. nAnalysis of distinct features of the clusters Date45Data Mining: Principles and Algorithms Constraints-Based Clustering nConstraints on individual objects nSimple selection of relevant objects before clustering nClustering parameters as constraints nK-means, density-based: radius, min-# of points nConstraints specified on clusters using SQL aggregates nSum of the profits in each cluster $1 million nConstraints imposed by physical obstacles n Clustering with obstructed distanc
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
评论
0/150
提交评论