




下载本文档
版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
1、毕业设计(论文)外文文献翻译专业计算机科学与技术学生姓名班级学号指导教师博雅学院中文译文数据挖掘技术简介摘要:微软 ? SQL Server ?2005 中提供用于创建和使用数据挖掘模型的集成环境的工作。本教程使用的四种情况:有针对性的邮件预测;顺序分析和聚类;演示如何使用挖掘模型算法;挖掘模型查看器和数据挖掘工具。介绍数据挖掘教程旨在通过创建走在Microsoft SQL Server 2005的 数据挖掘模型的过程。数据挖掘算法, 并在 SQL Server 2005 工具 可以很容易地建立一个项目,包括市场购物篮分析各种全面的解决方案,预测分析,有针对性的邮件分析。这些解决方案的情景更详
2、细的解释在后面的教程。SQL Server 2005最明显的部分是用来创建和处理数据挖掘模型的工作室。在线分析处理(OLAP )和数据挖掘工具被统一为两个工作环境:商业智能开发工作室和SQL Server 管理工作室。通过商业智能开发工作室,您可以在与服务器断开连接的情况下建立一个服务项目分析。当项目已经准备就绪,您可以发布到服务器上。您也可以直接面向服务器工作。SQL Server 管理工作室的主要职能是管理服务器。之后将有针对每一个环境的详细说明。欲了解更多关于从两个环境中选择的信息,请参看SQLServer联机丛书中的“在 SQL Server 工作室和商业智能开发工作室中选择”。数据挖
3、掘工具都存在于数据挖掘的编辑。使用编辑器,您可以管理挖掘模型,创造新模式,查看模型,比较模型,并建立在现有模型的预测。当你创建一个挖掘模型, 你会想要去探索它, 寻找有趣的模式和规则。在编辑器中的每个挖掘模型查看器是自定义进行探讨,以特定的算法建立的模型。如需观众的信息,请参看 SQL Server联机丛书中的“查看数据挖掘模型”。您的项目往往会包含多个挖掘模型,所以才能使用的模式创建的预测,你要能够确定哪些模式是最准确的。出于这个原因,编辑包含一个模型比较工具挖掘精度的图表标签。使用此工具,您可以比较准确的预测模型和您确定最佳模式。为了建立数据预期,你将使用一种DME 语言, DMX 扩展了
4、传统的SQL 语法,包含了一些创建修改和建立数据预期的命令, 关于 DMX 的详细信息,请参考 SQL BOL 中的 “Data Mining Extensions (DMX) Reference”章节。因为建立一个数据预期可能比较复杂,所以数据挖掘编辑器包含了一个工具叫做“ Prediction Query Builder”, 该工具可以让你在一个图形化的界面下编辑 DMX 查询语句,你也可以在该工具中可以查看自动生成的 DMX 语句。了解了前面介绍的实现数据挖掘的工具之外,同等重要的是了解数据挖掘模型的结构本身,建立一个数据模型的关键是数据挖掘算法,该算法在你操作的数据中寻找我们需要的部分
5、,并且转换这些数据成为一个可操作的数据模型。一些很重要的建立数据挖掘解决方案的步骤是用来整理准备那些用于建立数据模型的数据,SQL2005包含一个 DTS的工作环境以及一些 DTS的工具用于清理验证准备数据, 关于 DTS 的更多信息请查看 SQL BOL 中的 DTS Data Mining Tasks and Transformations章节。Adventure数据库AdventureWorksDW 数据库是基于一个虚构的自行车制造公司而建立,公司的名称叫做“Adventure Works Cycles”(简称 AW 公司)。 AW公司生产并向北美,欧洲和亚洲的商业市场销售金属和复合材料
6、的自行车,主要的工作都在华盛顿Bothell完成,那里拥有500 员工,以及一些地区销售部门遍及各地。AW 公司通过 INTERNET 批发和零售他们的产品,本教程中的数据模型实例需要你使用这些网络销售数据作为数据模型。关于 AW 公司数据库的更多信息请参考 SQL Server联机丛书中的如下章节: Sample Databases and Business Scenarios。数据库详细信息网络销售数据构架包含 9242个客户的信息,这些客户分布在 6个国家,并被合并为 3个区域:南美 (83%)欧洲 (12%)澳大利亚(7%)该数据库包含三个财政年度的数据: 2002年, 2003年和
7、2004年。数据库中的产品根据子类别,型号和产品来分类。商业智能开发工作室商业智能开发工作室是一套用于创建商务智能项目的工具。由于商业智能开发工作室是创建于 IDE 环境中的,在该环境中,你可以在脱机状态下创建一个完整地解决方案。你可以想改多少数据挖掘对象就改多少,但是在你发布该项目前,这些改变将不会反映在服务器上。一个 SSAS数据库用于集成多种技术, 这个数据库作为数据挖掘模型以及 OLAP 等技术的基础。你可以使用商业智能 建立和修改一个 SSAS项目并部署这个项目到一个或多个 SSAS服务如果你在开发一个 SSAS项目你也可以使用商业智能开发工作室直接连接数据库,这样你所作的改动可以立
8、刻影响到数据库中。SQL Server 管理工作室SQL Server管理工作室是一个行政和脚本工具与Microsoft SQLServer组件工作的集合。此工作区的不同之处,你是在互联环境中工作的行动是在传播到服务器只要您保存您的工作从商务智能开发工作室中。在数据被清理并为数据挖掘准备好后,大多数和创建苏局挖掘解决方案相关联的工作都在商业智能开发工作室中工作。通过使用商业智能开发工作室,你可以利用迭代过程确定的给定情况下的最佳模式来发布和测试数据挖掘解决方案。一旦开发商对解决方案满意,就可以将其发布到分析服务服务器。从这点来看,重点从 SQL Server管理工作室的开发转移到了维护和应用。
9、在 SQL Server管理工作室中,您可以管理您的数据库和执行一些在商业智能开发工作室中的相同的职能,比如在挖掘模式中查看、创建预测。数据转换服务在 SQL Server 2005中数据转换服务( DTS )包括抽取,转换和加载(简称 ETL )工具 。这些工具可用于执行一些数据挖掘中最重要的任务,为数据模型的建立清理和准备数据。在数据挖掘,您通常可以执行重复数据转换清理数据,然后利用这些数据组成挖掘模型。利用 DTS中的任务和转移,您可以把数据准备和模型建立结合为一个单一的 DTS包。DTS 公司还提供了 DTS设计器,以帮助您轻松地建立和运行的包含了所有的任务和转变的软件包。利用DTS
10、设计器,您可以将包发布到服务器上并定期的运行他们。这是非常有用例如,你每周收集数据资料,并向要每次自动执行相同的清洁转换工作。你可以通过向商业智能开发式的解决方案中分别增加项目来将数据转换项目和分析服务项目结合起来工作, 作为商务智能解决方案的一部分。挖掘模式算法数据挖掘算法是挖掘模型的创建的基础。 SQL Server 2005中各种各样的算法可以让你执行多种类型的执行。欲了解更多有关算法及其参数调整的信息,请参看 SQL Server联机丛书中的“数据挖掘算法”。决策树决策树算法支持分类与回归并且对预测模型也行之有效。利用该算法,你可以预测离散和连续这两个属性。在建立模型时,该算法检查每个
11、数据集的输入属性是怎样的影响预测属性的结果,以及使用最强的关系的输入属性制造了一系列的分裂,称为节点。随着新节点添加到模型中,树状结构开始形成。顶端节点树描述了大多数预测属性的统计分析。每个节点建立把预测属性比作投入的属性的分布情况上。如果输入的属性被视为导致预测属性有利于促成比另一个更好的状态,于是一个新的节点添加到模型。该模型继续增长,直到没有剩余的属性制造分裂提供了一个更好的预测在现有节点。该模型力图找到一个结合的属性和引起在预测属性不成比例分配的状态,因此,您可以预测预测属性的结果。簇簇算法采用迭代技术组从包含相似特性的数据及中进行分类。利用这些组合,您可以探讨的数据,更多地了解存在的
12、关系,这在理论上可能不容易通过偶然的观察获得。 此外,您也可以从算法创建的簇建立预测模型。例如,考虑那些住在同一社区,驱动器相同的车,吃同样的食物,买了类似的版本的产品的那一个群体的人。这是一组数据。另一组可能包括去相同的餐厅,也有类似的薪金,休假和每年两次以外的地区的人。观测这些集合是如何的分布,可以更好地了解预测属性的结果是如何相互影响的。传统贝叶斯在传统贝叶斯算法快速生成挖掘,可用于分类和预测的模型。它计算的每个输入属性的国家给予每个可预测属性,它可以用来预测以后的预测属性上已知的结果输入属性状态,概率。用于生成该模型的概率计算,并在立方体的处理中。该算法只支持离散或离散化的属性,它认为
13、所有输入属性是独立的。在传统贝叶斯算法产生一个简单的挖掘模型可以被认为是在数据挖掘过程的起点。由于在建立模型中使用的计算大多是在加工过程中产生的立方体,迅速返回结果。这使得该模型的一个探索发现的数据和如何在不同的输入属性的预测属性的不同分布状态不错的选择。时间系Microsoft时序算法创建,可用于预测了来自 OLAP 和关系数据源的时间连续变量模型。例如,您可以使用 Microsoft时序算法来预测销售和在一个立方体的历史数据为基础的利润。利用该算法,你可以选择一个或多个变量进行预测,但必须是连续的。您只能有一个为每个模型病例。此案系列标识系列中的位置,如超过之日起在几个月或几年的长度寻找销
14、售。一个案件可能含有一组变量 (例如,在不同的商店销售) 。 Microsoft 时序算法 可以用其预测交叉变量的相关性。 例如,在一家商店前的销售可能会在其他商店的预测目前的销售非常有用。神经网络在 Microsoft SQL Server 2005分析服务, Microsoft神经网络算法创建通过构建一个多层感知器神经元网络分类和回归挖掘模型。 类似 Microsoft决策树算法提供程序,那么每一个可预测属性的状态,该算法计算出的每个输入属性可能状态的概率。 该算法提供程序处理案件的整套, 反复比较,与已知的案件实际的分类个案的预测分类。从整个案件的第一次迭代的初始设置分类的错误是反馈到网
15、络,并用于修改为下一次迭代网络的性能,等等。您可以在以后使用这些概率来预测一个属性的预测结果,根据输入的属性。该算法之间和 Microsoft决策树算法的主要区别之一,但是,是其学习的过程是朝着减少错误,而 Microsoft决策树算法拆分规则,以最大限度地获取信息,优化网络参数。该算法同时支持离散和连续属性的预测。线性回归线性回归算法是决策树算法的一种特殊的构造,获得了无效的分裂(整个回归公式是建立在一个单一根节点)。该算法支持预测连续属性。逻辑回归逻辑回归算法是神经网络算法的一种特殊的构造,得到了消除隐蔽层。该算法支持预测的离散和连续属性。英文原文Introduction to Data
16、MiningAbstract: Microsoft? SQL Server? 2005 provides an integrated environment for creating and working with data mining models.This tutorial uses four scenarios, targeted mailing, forecasting, market basket, and sequence clustering, to demonstrate this release of SQL Server.IntroductionThe data min
17、ing tutorial is designed to walk you through the process of creatingdata mining models in Microsoft SQL Server 2005. The data mining algorithms and tools in SQL Server 2005 make it easy to build a comprehensive solution for a variety of projects, including market basket analysis, forecasting analysi
18、s, and targetedmailing analysis. The scenarios for these solutions are explained in greater detail laterin the tutorial.The most visible components in SQL Server 2005 are the workspaces that you use to create and work with data mining models. The online analytical processing (OLAP) and data mining t
19、ools are consolidated into two working environments:Business Intelligence Development Studio and SQL Server Management Studio. Using Business Intelligence Development Studio, you can develop an Analysis Servicesproject disconnected from the server. When the project is ready, you can deploy it to the
20、 server. You can also work directly against the server. The main function of SQLServer Management Studio is to manage the server. Each environment is described in more detail later in this introduction. For more information on choosing between the two environments, see "Choosing Between SQL Ser
21、ver Management Studio and Business Intelligence Development Studio" in SQL Server Books Online.All of the data mining tools exist in the data mining editor. Using the editor you can manage mining models, create new models, view models, compare models, and create predictions based on existing mo
22、dels.After you build a mining model, you will want to explore it, looking for interesting patterns and rules. Each mining model viewer in the editor is customizedto explore models built with a specific algorithm. For more information about theviewers, see "Viewing a Data Mining Model" in S
23、QL Server Books Online.Often your project will contain several mining models, so before you can use amodel to create predictions, you need to be able to determine which model is the mostaccurate. For this reason, the editor contains a model comparison tool called theMining Accuracy Chart tab. Using
24、this tool you can compare the predictive accuracy of your models and determine the best model.To create predictions, you will use the Data Mining Extensions (DMX) language.DMX extends SQL, containing commands to create, modify, and predict against mining models. For more information about DMX, see &
25、quot;Data Mining Extensions(DMX) Reference" in SQL Server Books Online. Because creating a prediction can be complicated, the data mining editor contains a tool called Prediction Query Builder,which allows you to build queries using a graphical interface. You can also view the DMX code that is
26、generated by the query builder.Just as important as the tools that you use to work with and create data mining models are the mechanics by which they are created. The key to creating a miningmodel is the data mining algorithm. The algorithm finds patterns in the data that you pass it, and it transla
27、tes them into a mining model it is the engine behind the process.Some of the most important steps in creating a data mining solution are consolidating, cleaning, and preparing the data to be used to create the mining models.SQL Server 2005 includes the Data Transformation Services (DTS) working envi
28、ronment, which contains tools that you can use to clean, validate, and prepareyour data. For more information on using DTS in conjunction with a data mining solution, see "DTS Data Mining Tasks and Transformations" in SQL Server BooksOnline.In order to demonstrate the SQL Server data minin
29、g features, this tutorial uses a new sample database called AdventureWorksDW. The database is included with SQL Server 2005, and it supports OLAP and data mining functionality. In order to makethe sample database available,you need to select the sample database at theinstallation time in the“ Advanc
30、ed” dialog for component selection.Adventure WorksAdventureWorksDWis based on a fictional bicycle manufacturingcompanynamed Adventure Works Cycles. Adventure Works produces and distributes metal and composite bicycles to North American, European, and Asian commercial markets. The base of operations
31、is located in Bothell, Washington with 500 employees, and several regional sales teams are located throughout their market base.Adventure Works sells products wholesale to specialty shops and to individualsthroughthe Internet.Forthedata miningexercises, you willworkwiththeAdventureWorksDW Internet s
32、ales tables, which contain realistic patterns that work well for data mining exercises.For more information on AdventureWorks Cycles see "Sample Databases andBusiness Scenarios" in SQL Server Books Online.Database DetailsThe Internetsales schema contains informationabout 9,242 customers. T
33、hesecustomers live in six countries, which are combined into three regions:North America (83%)Europe (12%)Australia (7%)The database contains data for three fiscal years: 2002, 2003, and 2004.The products in the database are broken down by subcategory, model, andproduct.Business Intelligence Develop
34、ment StudioBusiness Intelligence Development Studio is a set of tools designed for creating business intelligence projects. Because Business Intelligence Development Studio was created as an IDE environment in which you can create a complete solution, you workdisconnected from the server. You can ch
35、ange your data mining objects as much as you want, but the changes are not reflected on the server until after you deploy the project.Working in an IDE is beneficial for the following reasons:The Analysis Services project is the entry point for a business intelligence solution. An Analysis Services
36、project encapsulates mining models and OLAP cubes, along with supplemental objects that make up the Analysis Services database. FromBusiness Intelligence Development Studio, you can create and edit Analysis Services objects within a project and deploy the project to the appropriate Analysis Services
37、server or servers.If you are working with an existing Analysis Services project, you can also use Business Intelligence Development Studio to work connected the server. In this way, changes are reflected directly on the server without .SQL Server Management StudioSQL Server Management Studio is a co
38、llection of administrative and scripting tools for working with Microsoft SQL Server components. This workspace differs from Business Intelligence Development Studio in that you are working in a connected environment where actions are propagated to the server as soon as you save yourwork.Afterthe da
39、tacleaned and prepared for datamining, most of the tasksassociated with creating a data mining solution are performed within Business Intelligence Development Studio. Using the Business Intelligence Development Studio tools, you develop and test the data mining solution, using an iterative process t
40、odetermine which models work best for a given situation. When the developer is satisfied with the solution, it is deployed to an Analysis Services server. From this point, the focus shifts from development to maintenance and use, and thus SQLServer Management Studio. Using SQL Server Management Stud
41、io, you can administer your database and perform some of the same functions as in BusinessIntelligence Development Studio, such as viewing, and creating predictions from mining models.Data Transformation ServicesData TransformationServices (DTS)comprises the Extract,Transform,andLoad (ETL) tools in
42、SQL Server 2005. These tools can be used to perform some of themost importanttasks in data mining: cleaning and preparing the data for modelcreation. In data mining, you typically perform repetitive data transformations toclean the data before using the data to train a mining model. Using the tasks
43、and transformations in DTS, you can combine data preparation and model creation into a single DTS package.DTS also provides DTS Designer to packages containing all of the tasks and transformations. Using DTS Designer, you can deploy the packages to a server and run them on a regularly scheduled basi
44、s. This is useful if, for example, you collectdata weekly data and want to perform the same cleaning transformations each time in an automated fashion.You can work with a Data Transformation project and an Analysis Services project together as part of a business intelligence solution, by adding each
45、 project to asolution in Business Intelligence Development Studio.Mining Model AlgorithmsDatamining algorithms are the foundation from which mining models arecreated. The variety of algorithms included in SQL Server 2005 allows you to performmany types of analysis. For more specific information abou
46、t the algorithms andbeadjusted using parameters, see "Data Mining Algorithms" in SQL Server Books Online.Microsoft Decision TreesTheMicrosoftDecision Treesalgorithmsupportsbothclassification andregression and itworks well for predictive modeling. Using the algorithm, you canpredict both di
47、screte and continuous attributes.In building a model, the algorithm examinesthe dataset affects the result of thepredicted attribute, and then it uses the input attributes with the strongest relationship to create a series of splits, called nodes. As new nodes are added to the model, a tree structur
48、e begins to form. The top node of the tree describes thebreakdown of the predicted attribute over the overall population. Each additional node is created based on the distribution of states of the predicted attribute as compared to the input attributes. If an input attribute is seen to cause the pre
49、dictedattribute to favor one state over another, a new node is added to the model. The model continues to grow until none of the remaining attributes create a split that provides animproved prediction over the existing node. The model seeks to find a combination of attributes and their states that c
50、reates a disproportionate distribution of states in thepredicted attribute, therefore allowing you to predict the outcome of the predicted attribute.Microsoft ClusteringThe Microsoft Clustering algorithm uses iterative techniques to group recordsfrom a dataset into clusters containing similar charac
51、teristics. Using these clusters,you can explore the data, learning more about the relationships that exist, which maynot be easy to derive logically through casual observation. Additionally, you can create predictions from the clustering model created by the algorithm. For example, consider a group
52、of people who live in the same neighborhood, drive the same kind ofcar, eat the same kind of food, and buy a similar version of a product.This is acluster of data. Another cluster may include people who go to the same restaurants,twice a year outside the country. Observingbetter understanda dataset
53、interact,as well asaffects the outcome of a predicted attribute.Microsoft Na?ve BayesThe Microsoft Na? ve Bayes algorithm quickly builds mining models that can be used for classification and prediction. It calculates probabilities for each possible state of the input attribute, given each state of t
54、he predictable attribute, which can later beused to predict an outcome of the predicted attribute based on the known input attributes. The probabilities used to generate the model are calculated and storedduring the processing of the cube. The algorithm supports only discrete or discretized attribut
55、es, and it considers all input attributes to be independent. The MicrosoftNa?ve Bayes algorithm produces a simple mining model that can be considered a starting point in the data mining process. Because most of the calculations used in creating the model are generated during cube processing, results are returned quickly.This makes the model a good option for exploring the data and for discovering the different states of the predicted attribute.Microsoft Time SeriesThe Microsoft Time Series algorithm creates models that can be used to predict continuous variables over time from both
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- 预防儿童疾病课件
- 汽车美容洗车培训
- 音视频工程师课件
- 油田开发项目建设管理方案(参考模板)
- 电网侧独立储能示范项目质量管理方案(模板)
- xx片区城乡供水一体化项目人力资源管理方案(参考)
- 2025年矿业开采模块项目合作计划书
- 2025年耐侯钢合作协议书
- 2025年年物流仓储项目建议书
- 2025年地震数字遥测接收机项目发展计划
- 融资合作协议模板(2篇)
- 数字时代的商务英语写作知到章节答案智慧树2023年对外经济贸易大学
- 检验科沟通技巧及其它
- 2022年安徽大学科研助理(校聘)招聘60人笔试备考题库及答案解析
- 四年级阅读训练概括文章主要内容(完美)
- YY/T 0995-2015人类辅助生殖技术用医疗器械术语和定义
- GB/T 19352.1-2003热喷涂热喷涂结构的质量要求第1部分:选择和使用指南
- 智护训练讲解学习课件
- 母乳喂养自我效能量表(BSES) (1)附有答案
- 2023年盐城市阜宁县人民医院医护人员招聘笔试题库及答案解析
- 毕业论文答辩
评论
0/150
提交评论