数据挖掘原语和挖掘语言.ppt

上传人：j*** IP属地：河南上传时间：2020-10-24 格式：PPT 页数：25 大小：212.50KB 积分：20 举报 版权申诉

已阅读5页，还剩20页未读，继续免费阅读

版权说明：本文档由用户提供并上传，收益归属内容提供方，若内容存在侵权，请进行举报或认领

文档简介

1、挖掘原语，语言和体系结构,数据挖掘原语数据挖掘语言数据挖掘系统体系结构总结,数据挖掘原语划分,挖掘相关的数据挖掘的知识类型背景知识模式的兴趣度量结果的表示与可视化,任务相关数据,数据库（或数据仓库）名称例如：AllElectronics_db 数据库表（或数据仓库的立方体）例如：表item,customer,purchase,items_sold 数据选择条件例如：选取本年度加拿大地区购买商品的数据选取条件可能在概念上层次高于DB/DW的数据如：”type=home entertainment”，DB/DW中数据tv,cd player,vcr 有关的属性（或维）例如

2、：item表的name,price属性;customer表的income,age属性。系统应具备自动选取相关属性的机制，比如通过评估各属性与特定操作的相关程度。数据分组的标准例如：根据日期进行分组,挖掘的知识类型,描述(characterization) 区别分析(discrimination) 关联(association) 分类/预测(classification/prediction) 聚类(clustering),例: 用户如果想发掘AllElectronics数据库中用户的购买习惯，可能会选择下面关联规则： P(X:customer,W)Q(X,Y)=buys(X,Z) X是c

3、ustomer表的主键，P,Q是谓词变量(在相关数据中定义)，W,Y,Z是目标变量。可能的挖掘结果是： age(X,”3039”) income (X,”40k49k”) = buys(X,”VCR”)2.2%,60% accupation(X,”student”)age(X,”2029”)=buys(X,”computer”)1.4%,70%,背景知识：概念层次,概念层次模式层次(schema hierarchy) 例：Streetcityprovince_or_statecountry 集合-分组层次(set-grouping hierarchy) 例： young,middle_age

4、d,seniorall(age) 20-39 = young, 40-59 = middle_aged 基于操作层次(operation-derived hierarchy) 包括信息解码，复杂数据对象的信息提取，数据聚类，数据分布分析算法等例： email address: login-name department university country 基于规则层次(rule-based hierarchy) 例： low_profit_margin (X) = price(X, P1) and cost (X, P2) and (P1 - P2) $50 用户对数据间关系的预测可以用

5、于评价挖掘模式的兴趣度量,模式兴趣度量,简洁性(simplicity) 如：(关联) 规则长度, (决策) 决策树规模大小确定性(certainty) 如：confidence, P(A|B) = n(A and B)/ n (B), classification reliability or accuracy ( also known as rule reliability , rule strength, rule quality, certainty factor, discriminating weight )等. 有用性(utility) 如：support (associatio

6、n),s(A=B)=n(A nd B)/n(all), noise threshold (description) 新颖程度(novelty) 如：not previously known, surprising (used to remove redundant rules, e.g., Canada vs. Vancouver rule implication support ratio,结果模式的可视化,挖掘系统应能够用多种形式来显示发掘出来的模式如：规则，表，报表，图表，图，决策数和立方体挖掘系统应能够支持挖掘结果的多种操作如：drill-down , roll-up , sli

7、cing , dicing ,rotation,挖掘原语，语言和体系结构,数据挖掘原语数据挖掘语言数据挖掘系统体系结构总结,DMQL一个数据挖掘语言,动机为了能提供交互式数据挖掘能力通过提供一个类似SQL的语言希望能像SQL语言一样成为挖掘标准语言希望成为系统开发和演化(evolution)的基础希望促进信息交换，技术转移，商业化并获得广泛承认设计 DMQL 在前面介绍的挖掘原语基础之上进行设计,任务相关数据的语法表示,use database , or use data warehouse from where in relevance to order by group

8、by having ,任务相关数据语法表示（续）,例：如果挖掘AllElectronics的加拿大顾客经常购买的商品之间的关联，针对顾客不同收入和年龄，并且数据用购买日期进行分组。挖掘相关数据可以写成： use database AllElectronics_db in relevance to I.name , I.price , C.income , C.age from customer C , item I , purchase P , item_sold S where I.item_ID=S.item_ID and S.trans_ID=P.trans_ID and P.cust_

9、ID=C.cust_ID group by P.date,挖掘知识类型的语法,:= | | | | := Mine characterization as analyze 例：mine characteristics as customerPurchasing analyze count% :=Mine comparison as for where versus where analyze 例：mine comparison as purchaseGroups for bigSpenders where avg(I.price) $100 analyze count versus budge

10、tSpenders where avg(I.price),$100,挖掘知识类型的语法(续),:=mine association as matching 例：mine associations as buyingHabits matching P(X:customer,W)Q(X,Y)=buys(X,Z) :=mine classification as analyze 例： mine classification as classifyingCustomerCreditRating analyze credit_info :=Mine prediction as analyze set a

11、ttribute_or_dimention_i= 例：mine prediction as predictItemPrice analyze price set category = “TV” and brand=“SONY”,概念层次语法,语法： Use hierarchy for 不同概念层次采用不同定义方式模式概念层次 define hierarchy time_hierarchy on date as date,month quarter,year 集合-分组概念层次 define hierarchy age_hierarchy for age on customer as leve

12、l1: young, middle_aged, senior level0: all level2: 20, ., 39 level1: young level2: 40, ., 59 level1: middle_aged level2: 60, ., 89 level1: senior,概念层次语法（续）,基于操作概念模式(operation-derived hierarchies) define hierarchy age_hierarchy for age on customer as age_category(1), ., age_category(5) := cluster(def

13、ault, age, 5) $50) and (price - cost) $250,兴趣度量语法,语法： with threshold = threshold_value 例: with support threshold = 0.05 with confidence threshold = 0.7,挖掘知识表示语法,用户指定显示方法 display as 为在不同概念层次上观察结果： Multilevel_Manipulation := roll up on | drill down on | add | drop ,一个完整的DMQL语句,use database AllElectron

14、ics_db use hierarchy location_hierarchy for B.address mine characteristics as customerPurchasing analyze count% in relevance to C.age , I.type , I.place_made from customer C, item I , purchases P , items_sold S , works_at W , branch B where I.item_ID = S.item_ID and S.trans_ID = P.trans_ID and P.cus

15、t_ID = C.cust_ID and P.method_paid = AmEx and P.empl_ID = W.empl_ID and W.branch_ID = B.branch_ID and B.address = Canada and I.price = 100 with noise threshold = 0.05 display as table,其它数据挖掘语言,关联规则语言 MSQL (Imielinski & Virmani99) MineRule (Meo Psaila and Ceri96) Query flocks 基于Datalog 语法 (Tsur et al

16、98) OLEDB for DM (Microsoft2000) 和 OLE DB, OLE DB for OLAP一起致力于DB,DW,DM的标准化到2000年3月止，已经完成了predictive modeling( classification & Prediction ), clustering,还未包含 characterization, discrimination , association modeling 等。 CRISP-DM (CRoss-Industry Standard Process for Data Mining) 是一个国际性项目，包含数据库公司，数据仓库公司

17、，用户公司(user companies) 目的在于提供有效数据挖掘的平台和过程结构(process structure) 强调运用数据挖掘技术来解决商业问题,挖掘原语，语言和体系结构,数据挖掘原语数据挖掘语言数据挖掘系统体系结构总结,数据挖掘系统体系结构,数据挖掘系统与 DB/DW 系统的耦合程度零耦合用文件作为数据源和存放结果数据,不推荐松散耦合用DB/DW作数据源，查询结果写入文件或DB/DW；但不使用DB/DW的提供的数据结构和查询优化方法。半紧耦合提升挖掘系统性能部分挖掘原语在DB/DW中实现，如sorting, indexing, aggregation , histogram analysis, multiway join, precomputation of some statistic functions such as count ,sum

人人文库> 全部分类> 生活休闲 > 科普知识

温馨提示

1. 本站所有资源如无特殊说明，都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
2. 本站的文档不包含任何第三方提供的附件图纸等，如果需要附件，请联系上传者。文件的所有权益归上传用户所有。
3. 本站RAR压缩包中若带图纸，网页内容里面会有图纸预览，若没有图纸预览就没有图纸。
4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
5. 人人文库网仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对用户上传分享的文档内容本身不做任何修改或编辑，并不能对任何下载内容负责。
6. 下载文件中如有侵权或不适当内容，请与我们联系，我们立即纠正。
7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

数据挖掘原语和挖掘语言.ppt

文档简介

温馨提示

最新文档

评论

数据挖掘原语和挖掘语言.ppt

文档简介

温馨提示

最新文档

评论

相关文档