版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
1、数据挖掘第二次作业第一题:1. a) Compute the Information Gain for Gender, Car Type and Shirt Size.b) Construct a decision tree with Information Gain.答案:a) 因为class分为两类:C0和C1,其中C0的频数为10个,C1的频数为10,所以class元组的信息增益为Info(D)=-1020*log21020-1020*log2(10/20)=11.按照Gender进行分类:Infogender(D)=1020*-410*log2410-610*log2610+1020*
2、(-410*log2410-610*log2610)=0.971Gain(Gender)=1-0.971=0.0292.按照Car Type进行分类InfocarType(D)=420*-14*log214-34*log234+820*-78*log278-18*log218+820* -88*log288-08*log208=0.314Gain(Car Type)=1-0.314=0.6863.按照Shirt Size进行分类:InfoshirtSize(D)= 520*-35*log235-25*log225+720*-47*log247-37*log237+420* -24*log224
3、-24*log224+420*-24*log224-24*log224=0.988Gain(Shirt Size)=1-0.988=0.012b) 由a中的信息增益结果可以看出采用Car Type进行分类得到的信息增益最大,所以决策树为:Car Type?medium,large, extra largesmallC1C0C0luxurySportfamilyShirt Size?C1第二题:2. (a) Design a multilayer feed-forward neural network (one hidden layer) for the data set in Q1. Labe
4、l the nodes in the input and output layers.(b) Using the neural network obtained above, show the weight values after one iteration of the back propagation algorithm, given the training instance “(M, Family, Small)". Indicate your initial weight values and biases and the learning rate used.a)b)
5、由a可以设每个输入单元代表的属性和初始赋值X11X12X21X22X23X31X32X33X34FMFamilySportsLuxurySmallMediumLargeExtra Large011001000由于初始的权重和偏倚值是随机生成的所以在此定义初始值为:W1,10W1,11W2,10W2,11W3,10W3,11W4,10W4,11W5,10W5,110.20.2-0.2-0.10.40.3-0.2-0.10.1-0.1W6,10W6,11W7,10W7,11W8,10W8,11W9,10W9,11W10,12W11,120.1-0.2-0.40.20.20.2-0.10.3-0.3
6、-0.1101112-0.20.20.3净输入和输出:单元 j净输入 Ij输出Oj100.10.52110.20.55120.0890.48每个节点的误差表:单元jErrj100.0089110.003012-0.12权重和偏倚的更新:W1,10W1,11W2,10W2,11W3,10W3,11W4,10W4,11W5,10W5,110.2010.198-0.211-0.0990.40.308-0.202-0.0980.101-0.100W6,10W6,11W7,10W7,11W8,10W8,11W9,10W9,11W10,12W11,120.092-0.211-0.4000.1980.201
7、0.190-0.1100.300-0.304-0.099101112-0.2870.1790.344第三题:3.a) Suppose the fraction of undergraduate students who smoke is 15% and the fraction of graduate students who smoke is 23%. If one-fth of the college students are graduate students and the rest are undergraduates, what is the probability that a
8、student who smokes is a graduate student?b) Given the information in part (a), is a randomly chosen college student more likely to be a graduate or undergraduate student?c) Suppose 30% of the graduate students live in a dorm but only 10% of the undergraduate students live in a dorm. If a student smo
9、kes and lives in the dorm, is he or she more likely to be a graduate or undergraduate student? You can assume independence between students who live in a dorm and those who smoke.答:a) 定义:A=A1 ,A2其中A1表示没有毕业的学生,A2表示毕业的学生,B表示抽烟则由题意而知:P(B|A1)=15% P(B|A2)=23% P(A1)=4/5 P(A2)=1/5 则问题则是求P(A2|B)由则b) 由a可以看出随
10、机抽取一个抽烟的大学生,是毕业生的概率是0.277,未毕业的学生是0.723,所以有很大的可能性是未毕业的学生。c) 设住在宿舍为事件C则P(C|A2)=30% P(C|A1)=10% =0.4所以由上面的结果可以看出是毕业生的概率大一些第四题:4. Suppose that the data mining task is to cluster the following ten points (with(x, y, z) representing location) into three clusters:A1(4,2,5), A2(10,5,2), A3(5,8,7), B1(1,1,1)
11、, B2(2,3,2), B3(3,6,9), C1(11,9,2), C2(1,4,6), C3(9,1,7), C4(5,6,7)The distance function is Euclidean distance. Suppose initially we assign A1, B1, C1 as the center of each cluster, respectively. Use the K-Means algorithm to show only(a) The three cluster center after the first round execution(b) Th
12、e final three clusters答:a) 各点到中心点的欧式距离第一轮:A1B1C1A2549817A34110162B2146165B33393122C21434141C33010093C4217770从而得到的三个簇为:A1, A3,B3,C2, C3, C4 B1,B2 C1,A2所以三个簇新的中心为:(4.5,4.5,6.83),(1.5,2,1.5),(10.5,7,2)第二轮:新的簇均值为:(4.5,4.5,6.83),(1.5,2,1.5),(10.5,7,2)(4.5,4.5,6.83)(1.5,2,1.5)C1(10.5,7,2)A19.86111118.576.
13、25A253.8611181.54.25A312.5277878.556.25B158.527781.5127.25B231.861111.588.25B39.19444474.5106.25C185.86111139.54.25C213.1944424.5115.25C332.5277887.563.25C42.52777858.556.25所以得到的新的簇为:A1, A3,B3,C2, C3, C4 B1,B2 C1,A2得到的新的簇跟第一轮结束得到的簇的结果相同,不再变化,所以上面的簇是最终的结果。Part II: LabQuestion 1 Assume this supermarke
14、t would like to promote milk. Use the data in “transactions” as training data to build a decision tree (C5.0 algorithm) model to predict whether the customer would buy milk or not. 1. Build a decision tree using data set “transactions” that predicts milk as a function of the other fields. Set the “t
15、ype” of each field to “Flag”, set the “direction” of “milk” as “out”, set the “type” of COD as “Typeless”, select “Expert” and set the “pruning severity” to 65, and set the “minimum records per child branch” to be 95. Hand-in: A figure showing your tree.2. Use the model (the full tree generated by C
16、lementine in step 1 above) to make a prediction for each of the 20 customers in the “rollout” data to determine whether the customer would buy milk. Hand-in: your prediction for each of the 20 customers.3. Hand-in: rules for positive (yes) prediction of milk purchase identified from the decision tre
17、e (up to the fifth level. The root is considered as level 1). Compare with the rules generated by Apriori in Homework 1, and submit your brief comments on the rules (e.g., pruning effect)答:1生成的决策树为:生成的决策树模型为:juices = 1 Mode: 1 water = 1 Mode: 1 => 1 water = 0 Mode: 0 pasta = 1 Mode: 1 => 1 pas
18、ta = 0 Mode: 0 tomato souce = 1 Mode: 1 => 1 tomato souce = 0 Mode: 0 biscuits = 1 Mode: 1 => 1 biscuits = 0 Mode: 0 => 0 juices = 0 Mode: 0 yoghurt = 1 Mode: 1 water = 1 Mode: 1 => 1 water = 0 Mode: 0 biscuits = 1 Mode: 1 => 1 biscuits = 0 Mode: 0 brioches = 1 Mode: 1 => 1 brioche
19、s = 0 Mode: 0 beer = 1 Mode: 1 => 1 beer = 0 Mode: 0 => 0 yoghurt = 0 Mode: 0 beer = 1 Mode: 0 biscuits = 1 Mode: 1 => 1 biscuits = 0 Mode: 0 rice = 1 Mode: 1 => 1 rice = 0 Mode: 0 coffee = 1 Mode: 1 water = 1 Mode: 1 => 1 water = 0 Mode: 0 => 0 coffee = 0 Mode: 0 => 0 beer = 0
20、Mode: 0 frozen vegetables = 1 Mode: 0 biscuits = 1 Mode: 1 pasta = 1 Mode: 1 => 1 pasta = 0 Mode: 0 => 0 biscuits = 0 Mode: 0 oil = 1 Mode: 1 => 1 oil = 0 Mode: 0 brioches = 1 Mode: 0 water = 1 Mode: 1 => 1 water = 0 Mode: 0 => 0 brioches = 0 Mode: 0 => 0 frozen vegetables = 0 Mode
21、: 0 pasta = 1 Mode: 0 mozzarella = 1 Mode: 1 => 1 mozzarella = 0 Mode: 0 water = 1 Mode: 1 biscuits = 1 Mode: 1 => 1 biscuits = 0 Mode: 0 brioches = 1 Mode: 1 => 1 brioches = 0 Mode: 0 coffee = 1 Mode: 1 => 1 coffee = 0 Mode: 0 => 0 water = 0 Mode: 0 coke = 1 Mode: 0 coffee = 1 Mode:
22、1 => 1 coffee = 0 Mode: 0 => 0 coke = 0 Mode: 0 => 0 pasta = 0 Mode: 0 water = 1 Mode: 0 coffee = 1 Mode: 1 => 1 coffee = 0 Mode: 0 => 0 water = 0 Mode: 1 rice = 1 Mode: 0 => 0 rice = 0 Mode: 1 tunny = 1 Mode: 0 biscuits = 1 Mode: 1 => 1 biscuits = 0 Mode: 0 => 0 tunny = 0 Mo
23、de: 1 brioches = 1 Mode: 0 => 0 brioches = 0 Mode: 1 coke = 1 Mode: 0 => 0 coke = 0 Mode: 1 coffee = 1 Mode: 0 => 0 coffee = 0 Mode: 1 biscuits = 1 Mode: 0 => 0 biscuits = 0 Mode: 1 oil = 1 Mode: 0 => 0 oil = 0 Mode: 1 tomato souce = 1 Mode: 0 => 0 tomato souce = 0 Mode: 1 mozzarella = 1 Mode: 0 => 0 mozzarella = 0 Mode: 1 crackers = 1 Mode: 0 => 0 crackers = 0 Mode: 1 frozen fish = 1 Mode: 0 => 0 frozen fish = 0 Mode: 1 => 12按照1中生成的据册数进行预测的结果:4. 生成的关联规则为:Question 2:
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- 水浒传到回题目及答案
- 金融机构风险管理框架与实施指南
- 2025年旅游服务质量管理与评价体系
- 农业科技项目实施方案(标准版)
- 村干部任前培训制度
- 公司培训报备制度
- 合作社培训工作制度汇编
- 大学生党员教育培训制度
- 英语培训学员请假制度
- 俞凌雄企业培训制度
- 共享单车对城市交通的影响研究
- 监理大纲(暗标)
- 学校宿舍楼施工组织设计方案
- GB/T 7216-2023灰铸铁金相检验
- 学术论文的撰写方法
- 上海市汽车维修结算工时定额(试行)
- 贵州省晴隆锑矿采矿权出让收益评估报告
- 中心小学11-12学年度教师年度量化评分实施方案
- SH/T 1627.1-1996工业用乙腈
- JJG 1030-2007超声流量计
- 基础研究类成果评价指标成果评价指标
评论
0/150
提交评论