数学建模-数据预处理_第1页
数学建模-数据预处理_第2页
数学建模-数据预处理_第3页
数学建模-数据预处理_第4页
数学建模-数据预处理_第5页
已阅读5页,还剩52页未读 继续免费阅读

数学建模-数据预处理.pdf 免费下载

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

1 第第2 2 2 2章章 数据预处理数据预处理 2 第第2 2 2 2章章: : : : 数据预处理数据预处理 为什么预处理数据为什么预处理数据? ? ? ? 数据清理数据清理 数据集成数据集成 数据归约数据归约 离散化和概念分层产生离散化和概念分层产生 小结小结 3 为什么数据预处理为什么数据预处理? ? ? ? 现实世界中的数据是脏的现实世界中的数据是脏的 不完全不完全: : : : 缺少属性值缺少属性值, , , , 缺少某些有趣的属性缺少某些有趣的属性, , , , 或仅包含聚集数据或仅包含聚集数据 例例, , , , occupation=occupation=occupation=occupation=“”“”“”“” 噪音噪音: : : : 包含错误或孤立点包含错误或孤立点 例例, , , , Salary=Salary=Salary=Salary=“ “ “ “-10-10-10-10” ” ” ” 不一致不一致: : : : 编码或名字存在差异编码或名字存在差异 例例, , , , Age=Age=Age=Age=“ “ “ “42424242” ” ” ” Birthday= Birthday= Birthday= Birthday=“ “ “ “03/07/201003/07/201003/07/201003/07/2010” ” ” ” 例例, , , , 以前的等级以前的等级 “ “ “ “1,2,31,2,31,2,31,2,3” ” ” ”, , , , 现在的等级现在的等级 “ “ “ “A, B, CA, B, CA, B, CA, B, C” ” ” ” 例例, , , , 重复记录间的差异重复记录间的差异 4 数据为什么脏数据为什么脏? ? ? ? 不完全数据源于不完全数据源于 数据收集时未包含数据收集时未包含 数据收集和数据分析时的不同考虑数据收集和数据分析时的不同考虑. . . . 人人/ / / /硬件硬件/ / / /软件问题软件问题 噪音数据源于噪音数据源于 收集收集 录入录入 变换变换 不一致数据源于不一致数据源于 不同的数据源不同的数据源 违反函数依赖违反函数依赖 5 为什么数据预处理是重要的为什么数据预处理是重要的? ? ? ? 没有高质量的数据没有高质量的数据, , , , 就没有高质量的数据挖掘结果就没有高质量的数据挖掘结果! ! ! ! 高质量的决策必然依赖高质量的数据高质量的决策必然依赖高质量的数据 例如例如, , , , 重复或遗漏的数据可能导致不正确或误重复或遗漏的数据可能导致不正确或误 导的统计导的统计. . . . 数据仓库需要高质量数据的一致集成数据仓库需要高质量数据的一致集成 数据提取数据提取, , , , 清理清理, , , , 和变换是建立数据仓库的最主要的和变换是建立数据仓库的最主要的 工作工作Bill Bill Bill Bill InmonInmonInmonInmon 6 数据质量:一个多维视角数据质量:一个多维视角 一种广泛接受的多角度一种广泛接受的多角度: : : : 正确性正确性( ( ( (Accuracy)Accuracy)Accuracy)Accuracy) 完全性完全性( ( ( (Completeness)Completeness)Completeness)Completeness) 一致性一致性( ( ( (Consistency)Consistency)Consistency)Consistency) 合时合时( ( ( (Timeliness)Timeliness)Timeliness)Timeliness):timely update? timely update? timely update? timely update? 可信性可信性( ( ( (Believability)Believability)Believability)Believability) 可解释性可解释性( ( ( (Interpretability)Interpretability)Interpretability)Interpretability) 可存取性可存取性( ( ( (Accessibility)Accessibility)Accessibility)Accessibility) 7 数据预处理的主要任务数据预处理的主要任务 数据清理数据清理 填充缺失值填充缺失值, , , , 识别识别/ / / /去除离群点去除离群点, , , , 光滑噪音光滑噪音, , , , 并纠正数据中的不并纠正数据中的不 一致一致 数据集成数据集成 多个数据库多个数据库, , , , 数据立方体数据立方体, , , , 或文件的集成或文件的集成 数据变换数据变换 规范化和聚集规范化和聚集 数据归约数据归约 得到数据的归约表示得到数据的归约表示, , , , 它小得多它小得多, , , , 但产生相同或类似的分析结但产生相同或类似的分析结 果:维度规约、数值规约、数据压缩果:维度规约、数值规约、数据压缩 数据离散化和概念分层数据离散化和概念分层 8 数据预处理的形式数据预处理的形式 9 第第2 2 2 2章章: : : : 数据预处理数据预处理 为什么预处理数据为什么预处理数据? ? ? ? 数据清理数据清理 数据集成数据集成 数据归约数据归约 离散化和概念分层产生离散化和概念分层产生 小结小结 10 数据清理数据清理 Data CleaningData CleaningData CleaningData Cleaning 现实世界现实世界dededede数据是脏:很多潜在的不正确的数据,比如,数据是脏:很多潜在的不正确的数据,比如, 仪器故障,人为或计算机错误,许多传输错误仪器故障,人为或计算机错误,许多传输错误 incompleteincompleteincompleteincomplete: : : :缺少属性值缺少属性值, , , , 缺少某些有趣的属性缺少某些有趣的属性, , , , 或仅包含聚集数据或仅包含聚集数据 e.g., e.g., e.g., e.g., 职业职业= = = =“ “ “ “ ” ” ” ” (missing data) (missing data) (missing data) (missing data) noisynoisynoisynoisy: : : :包含错误或孤立点包含错误或孤立点 e.g., e.g., e.g., e.g., SalarySalarySalarySalary= = = =“ “ “ “- - 10101010” ” ” ” (an error) (an error) (an error) (an error) inconsistentinconsistentinconsistentinconsistent: : : :编码或名字存在差异编码或名字存在差异, e.g., e.g., e.g., e.g., AgeAgeAgeAge= = = =“ “ “ “42424242” ” ” ”, , , , BirthdayBirthdayBirthdayBirthday= = = =“ “ “ “03/07/201003/07/201003/07/201003/07/2010” ” ” ” 以前的等级以前的等级 “ “ “ “1, 2, 31, 2, 31, 2, 31, 2, 3” ” ” ”, , , , 现在等级现在等级 “ “ “ “A, B, CA, B, CA, B, CA, B, C” ” ” ” 重复记录间的差异重复记录间的差异 有意的有意的(e.g.,(e.g.,(e.g.,(e.g.,变相丢失的数据变相丢失的数据) ) ) ) Jan. 1 as everyoneJan. 1 as everyoneJan. 1 as everyoneJan. 1 as everyone s birthday?s birthday?s birthday?s birthday? 11 如何处理缺失数据如何处理缺失数据? ? ? ? 忽略元组忽略元组: : : : 缺少类别标签时常用缺少类别标签时常用( ( ( (假定涉及分类假定涉及分类不是很有不是很有 效,当每个属性的缺失百分比变化大时效,当每个属性的缺失百分比变化大时 手工填写缺失数据手工填写缺失数据: : : : 乏味乏味+ + + +费时费时+ + + +不可行不可行 ? ? ? ? 自动填充自动填充 一个全局常量一个全局常量 : e.g., : e.g., : e.g., : e.g., “ “ “ “unknownunknownunknownunknown” ” ” ”, a new class?! , a new class?! , a new class?! , a new class?! 使用属性均值使用属性均值 与目标元组同一类的所有样本的属性均值与目标元组同一类的所有样本的属性均值: : : : 更巧妙更巧妙 最可能的值最可能的值: : : : 基于推理的方法,如基于推理的方法,如贝叶斯公式或决策树贝叶斯公式或决策树 12 噪音数据噪音数据Noisy DataNoisy DataNoisy DataNoisy Data Noise: Noise: Noise: Noise: 被测量的变量的随机误差或方差被测量的变量的随机误差或方差 不正确的属性值可能由于不正确的属性值可能由于 错误的数据收集工具错误的数据收集工具 数据录入问题数据录入问题 data entry problemsdata entry problemsdata entry problemsdata entry problems 数据传输问题数据传输问题data transmission problemsdata transmission problemsdata transmission problemsdata transmission problems 技术限制技术限制 technology limitationtechnology limitationtechnology limitationtechnology limitation 不一致的命名惯例不一致的命名惯例 inconsistency in naming convention inconsistency in naming convention inconsistency in naming convention inconsistency in naming convention 其他需要数据清理的问题其他需要数据清理的问题 重复记录重复记录 duplicate recordsduplicate recordsduplicate recordsduplicate records 数据不完整数据不完整 incomplete dataincomplete dataincomplete dataincomplete data 不一致的数据不一致的数据 inconsistent datainconsistent datainconsistent datainconsistent data 13 如何处理噪音数据如何处理噪音数据? ? ? ? 分箱分箱Binning method:Binning method:Binning method:Binning method: 排序数据,分布到等频排序数据,分布到等频/ / / /等宽的箱等宽的箱/ / / /桶中桶中 箱均值光滑、箱中位数光滑、箱边界光滑箱均值光滑、箱中位数光滑、箱边界光滑, etc., etc., etc., etc. 聚类聚类ClusteringClusteringClusteringClustering 检测和去除检测和去除 离群点离群点/ / / /孤立点孤立点 outliersoutliersoutliersoutliers 计算机和人工检查相结合计算机和人工检查相结合 人工检查可疑值人工检查可疑值 (e.g., deal with possible outliers) (e.g., deal with possible outliers) (e.g., deal with possible outliers) (e.g., deal with possible outliers) 回归回归 RegressionRegressionRegressionRegression 回归函数拟合数据回归函数拟合数据 14 分箱:简单的离散化方法分箱:简单的离散化方法 等宽度等宽度Equal-widthEqual-widthEqual-widthEqual-width (distance) (distance) (distance) (distance) 剖分剖分: : : : 分成大小相等的分成大小相等的n n n n个区间个区间: : : : 均匀网格均匀网格 uniform griduniform griduniform griduniform grid 若若A A A A和和B B B B是是 属性的最低和最高取值属性的最低和最高取值, , , , 区间宽度为区间宽度为: : : : W W W W = (= (= (= (B B B B A A A A)/ )/ )/ )/N.N.N.N. 孤立点可能占据重要影响孤立点可能占据重要影响 may dominate presentation may dominate presentation may dominate presentation may dominate presentation 倾斜的数据处理不好倾斜的数据处理不好. . . . 等频剖分等频剖分 (frequency) /(frequency) /(frequency) /(frequency) /等深等深: : : : 分成分成n n n n个区间个区间, , , , 每一个含近似相同数目的样本每一个含近似相同数目的样本 Good data scalingGood data scalingGood data scalingGood data scaling 类别属性可能会非常棘手类别属性可能会非常棘手. . . . 15 Binning Methods for Data SmoothingBinning Methods for Data SmoothingBinning Methods for Data SmoothingBinning Methods for Data Smoothing * * * * Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34 * * * * Partition into ( Partition into ( Partition into ( Partition into (equiequiequiequi-depth) bins:-depth) bins:-depth) bins:-depth) bins: - - - - Bin 1Bin 1Bin 1Bin 1: 4, 8, 9, 15: 4, 8, 9, 15: 4, 8, 9, 15: 4, 8, 9, 15 - - - - Bin 2Bin 2Bin 2Bin 2: 21, 21, 24, 25: 21, 21, 24, 25: 21, 21, 24, 25: 21, 21, 24, 25 - - - - Bin 3Bin 3Bin 3Bin 3: 26, 28, 29, 34: 26, 28, 29, 34: 26, 28, 29, 34: 26, 28, 29, 34 * * * * Smoothing by bin means: Smoothing by bin means: Smoothing by bin means: Smoothing by bin means: - - - - Bin 1Bin 1Bin 1Bin 1: 9, 9, 9, 9: 9, 9, 9, 9: 9, 9, 9, 9: 9, 9, 9, 9 - - - - Bin 2Bin 2Bin 2Bin 2: 23, 23, 23, 23: 23, 23, 23, 23: 23, 23, 23, 23: 23, 23, 23, 23 - - - - Bin 3Bin 3Bin 3Bin 3: 29, 29, 29, 29: 29, 29, 29, 29: 29, 29, 29, 29: 29, 29, 29, 29 * * * * Smoothing by bin boundaries: Smoothing by bin boundaries: Smoothing by bin boundaries: Smoothing by bin boundaries: - Bin 1: 4, 4, 4, 15 - Bin 1: 4, 4, 4, 15 - Bin 1: 4, 4, 4, 15 - Bin 1: 4, 4, 4, 15 - Bin 2: 21, 21, 25, 25 - Bin 2: 21, 21, 25, 25 - Bin 2: 21, 21, 25, 25 - Bin 2: 21, 21, 25, 25 - Bin 3: 26, 26, 26, 34 - Bin 3: 26, 26, 26, 34 - Bin 3: 26, 26, 26, 34 - Bin 3: 26, 26, 26, 34 16 聚类分析聚类分析 17 RegressionRegressionRegressionRegression x y y = x + 1 X1 Y1 Y1 18 数据清理座位一个过程数据清理座位一个过程 数据偏差检测数据偏差检测 Data discrepancy detectionData discrepancy detectionData discrepancy detectionData discrepancy detection 使用元数据使用元数据( ( ( (数据性质的知识数据性质的知识) ) ) )(e.g.,(e.g.,(e.g.,(e.g.,领域领域, , , , 长度范围长度范围, , , ,从属从属, , , , 分布分布) ) ) ) 检查字段过载检查字段过载 field overloading field overloading field overloading field overloading 检查唯一性规则检查唯一性规则, , , , 连续性规则连续性规则, , , ,空值规则空值规则 使用商业工具使用商业工具 数据清洗数据清洗Data scrubbing: Data scrubbing: Data scrubbing: Data scrubbing: 使用简单的领域知识使用简单的领域知识(e.g., (e.g., (e.g., (e.g., 邮编邮编, , , , 拼写检拼写检 查查) ) ) ) 检查并纠正错误检查并纠正错误 数据审计数据审计 Data auditing: Data auditing: Data auditing: Data auditing: 通过分析数据发现规则和联系发现违规通过分析数据发现规则和联系发现违规 者者( ( ( (孤立点孤立点) ) ) ) 数据迁移和集成数据迁移和集成 数据迁移工具数据迁移工具Data migration tools:Data migration tools:Data migration tools:Data migration tools:允许指定转换允许指定转换 提取提取/ / / /变换变换/ / / /装入工具装入工具ETL (Extraction/Transformation/Loading) tools: ETL (Extraction/Transformation/Loading) tools: ETL (Extraction/Transformation/Loading) tools: ETL (Extraction/Transformation/Loading) tools: 允许用户通过图形用户界面指定变换允许用户通过图形用户界面指定变换 整合两个过程整合两个过程 两个过程两个过程迭代和迭代和交互执行交互执行(e.g., Potter(e.g., Potter(e.g., Potter(e.g., Potter s Wheels)s Wheels)s Wheels)s Wheels) 19 第第2 2 2 2章章: : : : 数据预处理数据预处理 为什么预处理数据为什么预处理数据? ? ? ? 数据清理数据清理 数据集成数据集成 数据归约数据归约 离散化和概念分层产生离散化和概念分层产生 小结小结 20 Data IntegrationData IntegrationData IntegrationData Integration Data integration: Data integration: Data integration: Data integration: combines data from multiple sources into a coherent storecombines data from multiple sources into a coherent storecombines data from multiple sources into a coherent storecombines data from multiple sources into a coherent store Schema integrationSchema integrationSchema integrationSchema integration integrate metadata from different sourcesintegrate metadata from different sourcesintegrate metadata from different sourcesintegrate metadata from different sources Entity identification problem: identify real world entities from Entity identification problem: identify real world entities from Entity identification problem: identify real world entities from Entity identification problem: identify real world entities from multiple data sources, e.g., A.cust-id multiple data sources, e.g., A.cust-id multiple data sources, e.g., A.cust-id multiple data sources, e.g., A.cust-id B. B. B. B.cust-#cust-#cust-#cust-# Detecting and resolving data value conflictsDetecting and resolving data value conflictsDetecting and resolving data value conflictsDetecting and resolving data value conflicts for the same real world entity, attribute values from different sources for the same real world entity, attribute values from different sources for the same real world entity, attribute values from different sources for the same real world entity, attribute values from different sources are differentare differentare differentare different possible reasons: different representations, different scales, e.g., possible reasons: different representations, different scales, e.g., possible reasons: different representations, different scales, e.g., possible reasons: different representations, different scales, e.g., metric vs. British unitsmetric vs. British unitsmetric vs. British unitsmetric vs. British units 21 Handling Redundant Data Handling Redundant Data Handling Redundant Data Handling Redundant Data in Data Integrationin Data Integrationin Data Integrationin Data Integration Redundant data occur often when integration of multiple Redundant data occur often when integration of multiple Redundant data occur often when integration of multiple Redundant data occur often when integration of multiple databasesdatabasesdatabasesdatabases The same attribute may have different names in different databasesThe same attribute may have different names in different databasesThe same attribute may have different names in different databasesThe same attribute may have different names in different databases One attribute may be a One attribute may be a One attribute may be a One attribute may be a “ “ “ “derivedderivedderivedderived” ” ” ” attribute in another table, e.g., attribute in another table, e.g., attribute in another table, e.g., attribute in another table, e.g., annual revenueannual revenueannual revenueannual revenue Redundant data may be able to be detected by Redundant data may be able to be detected by Redundant data may be able to be detected by Redundant data may be able to be detected by correlationalcorrelationalcorrelationalcorrelational analysisanalysisanalysisanalysis Careful integration of the data from multiple sources may Careful integration of the data from multiple sources may Careful integration of the data from multiple sources may Careful integration of the data from multiple sources may help reduce/avoid redundancies and inconsistencies and help reduce/avoid redundancies and inconsistencies and help reduce/avoid redundancies and inconsistencies and help reduce/avoid redundancies and inconsistencies and improve mining speed and qualityimprove mining speed and qualityimprove mining speed and qualityimprove mining speed and quality 22 Data TransformationData TransformationData TransformationData Transformation Smoothing: remove noise from dataSmoothing: remove noise from dataSmoothing: remove noise from dataSmoothing: remove noise from data Aggregation: summarization, data cube constructionAggregation: summarization, data cube constructionAggregation: summarization, data cube constructionAggregation: summarization, data cube construction Generalization: concept hierarchy climbingGeneralization: concept hierarchy climbingGeneralization: concept hierarchy climbingGeneralization: concept hierarchy climbing Normalization: scaled to fall within a small, specified Normalization: scaled to fall within a small, specified Normalization: scaled to fall within a small, specified Normalization: scaled to fall within a small, specified rangerangerangerange min-max normalizationmin-max normalizationmin-max normalizationmin-max normalization z-score normalizationz-score normalizationz-score normalizationz-score normalization normalization by decimal scalingnormalization by decimal scalingnormalization by decimal scalingnormalization by decimal scaling Attribute/feature constructionAttribute/feature constructionAttribute/feature constructionAttribute/feature construction New attributes constructed from the given onesNew attributes constructed from the given onesNew attributes constructed from the given onesNew attributes constructed from the given ones 23 Data Transformation: Data Transformation: Data Transformation: Data Transformation: NormalizationNormalizationNormalizationNormalization min-max normalizationmin-max normalizationmin-max normalizationmin-max normalization z-score normalizationz-score normalizationz-score normalizationz-score normalization normalization by decimal scalingnormalization by decimal scalingnormalization by decimal scalingnormalization by decimal scaling AAA AA A minnewminnewmaxnew minmax minv v_)_(+ = A A devstand meanv v _ = j v v 10 =Where j is the smallest integer such that Max(| |)Reduced attribute set: A1, A4, A6 29 Heuristic Feature Selection MethodsHeuristic Feature Selection MethodsHeuristic Feature Selection MethodsHeuristic Feature Selection Methods There are There are There are There are 2 2 2 2d d d d possible sub-features of possible sub-features of possible sub-features of possible sub-features of d d d d features features features features Several heuristic feature selection methods:Several heuristic feature selection methods:Several heuristic feature selection methods:Several heuristic feature selection methods: Best single features under the feature independence assumption: Best single features under the feature independence assumption: Best single features under the feature independence assumption: Best single features under the feature independence assumption: choose by significance tests.choose by significance tests.choose by significance tests.choose by significance tests. Best step-wise feature selection: Best step-wise feature selection: Best step-wise feature selection: Best step-wise feature selection: The best single-feature is picked firstThe best single-feature is picked firstThe best single-feature is picked firstThe best single-feature is picked first Then next best feature condition to the first, .Then next best feature condition to the first, .Then next best feature condition to the first, .Then next best feature condition to the first, . Step-wise feature elimination:Step-wise feature elimination:Step-wise feature elimination:Step-wise feature elimination: Repeatedly eliminate the worst featureRepeatedly eliminate the worst featureRepeatedly eliminate the worst featureRepeatedly eliminate the worst feature Best combined feature selection and elimination:Best combined feature selection and elimination:Best combined feature selection and elimination:Best combined feature selection and elimination: Optimal branch and bound:Optimal branch and bound:Optimal branch and bound:Optimal branch and bound: Use feature elimination and backtrackingUse feature elimination and backtrackingUse feature elimination and backtrackingUse feature elimination and backtracking 30 Data CompressionData CompressionData CompressionData Compression String compressionString compressionString compressionString compression There are extensive theories and well-tuned algorithmsThere are extensive theories and well-tuned algorithmsThere are extensive theories and well-tuned algorithmsThere are extensive theories and well-tuned algorithms Typically losslessTypically losslessTypically losslessTypically lossless But only limited manipulation is possible without expansionBut only limited manipulation is possible without expansionBut only limited manipulation is possible without expansionBut only limited manipulation

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论