中科院刘莹大数据挖掘课程作业2

上传人：s*** IP属地：天津上传时间：2021-06-06 格式：DOCX 页数：20 大小：251.94KB 积分：20 举报 版权申诉

已阅读5页，还剩15页未读，继续免费阅读

版权说明：本文档由用户提供并上传，收益归属内容提供方，若内容存在侵权，请进行举报或认领

文档简介

1、实用标准文档文案大全Part I: written assignmentHW2Due Date: Nov. 231.a) Compute the Information Gain for Gender, Car Type and Shirt Size.本题的class有两类；即Co和C1?, ?3? = ?(? ?)1? ? ?) = ?(? ?+? ?(? ?)? ? ? ? ? ? ? ?=莎?-云？?？云？?+五?- ? ?議?酥0.971Gain (?=?(? ? -?D?) =1-0.97仁0.? ? ?UU2?豁= ?(?, ?+-7?, ?+-7?, ?)? ? ? ? ? ?=

2、 ?-刃?-刃?+云?-? ? ? ?- ?=0.3797Gain (?, ?-?) =1-0.3797=0.6203? ?222背？?T?T?(D)?=?思？ ?+ 云?,?)+ ? ?+ ?(? ?)? ? ? ? ? ? ? ? ? ? ? ? ?二恐?谿？?栩+应-?-沪?？+?谿? ? =0.9876Gain (? 7? ?-?) =1-0.9876=0.0124b) Construct a decision tree with Information Gain. 由a知，CarType的in formation Gai n最大，故本题应该选择 CarType作为首要分裂属性。Car

3、Type的类别有Luxury family Sport (因全部属于C0类，此类无需再划分) 对Luxury进一步划分：?, ?= ?“?, ?=0.5436?)=?(?, ?+-?(? ?= ?-?(- -?- ?初=0.5177Gain ( ?(? ?-?)=0.5436-0.5177=0.0259?b?(?G?)? ?(? ?+? ?(? ?+? ?(? ?+? ?(? ?=0.25 Gain ( ? 7? ?)-?/?)=0.5436-0.25=0.2936 故此处选择ShirtSize进行属性分裂。对family进一步划分：? ?= ?(? ?=0.811Gain (?(?, ?-

4、?D) =0.811- ?(? ?=0Gain (? ?)-?/?)?=0.811- ? ?，?, ? ?(?, ?=0.811? ?2.故此处选择ShirtSize进行属性分裂。根据以上的计算可得本题的决策数如下:COLuxuryShirtTypeSmall Medium Large ExtraLarge|C1C1C0 C1C1CarTypeShirtTypeCOShirtTypef /1smallOtherLargeOtherJACO C1C1COC1FamilySportsLuxury(a) Design a multilayer feed-forward neural network

5、 (one hidden layer) for the data set in Q1. Label the nodes in the input and output layers.根据数据的属性特点易知输入层有 8个节点，分别为:x1 Gen der ( Ge nder = M: x1 = 1; Ge nder = F: x1 = 0 )x2 Car Type = Sports ( Y = 1; N = 0)x3 Car Type = Family( Y = 1; N = 0)x4 Car Type = Luxury ( Y = 1; N = 0)x5 Shirt Size = Small

6、( Y = 1; N = 0)x6 Shirt Size = Medium ( Y = 1; N = 0)x7 Shirt Size = Large ( Y = 1; N = 0)x8 Shirt Size = Extra Large ( Y = 1; N = 0)隐藏层有三个节点x9、x10和x11.输出为二类问题，因此只有1个节点x12(C o=1;C2=O).神经网络图如下：（其中Wij表示输入层第i个节点到隐藏层第j个节点所付权重，为方便计算，第i个节点到第9/10/11个节点的权重设置一样；Wi-j则表示隐藏层第i个节点到输出层节点所赋予的权重）X1X2X3X4X5X6X7X8W 2

7、j9W9-12w6jW10-12W11 -1211w8jw3j匚w1j输入层隐藏层输出层w4j10Al12c)Using the neural network obtained above, show the weight values after one iteration of theback propagation algorithm, given the training instance“ (M, Family, Small). Indicate yourinitial weight values and biases and the learning rate used.对于(M,

8、 Family, Small), 其类标号为CO,其训练元祖为1,0, 1,0, 1,0, 0, 0.表1初始输入、权重、偏倚值和学习率X1X2X3X4X5X6X7X8W1jW2jW3jW4j101010000.2W5jW6jW7jW8jW9-12W10-12W11-12?0?1?2L---表2净输入和净输出计算单元j净输入Ij净输出Oj91*0.1 + 1*0.1+1*0.1+0.1=0.41+(1+ e-0.4 )=0.51101*0.1 + 1*0.1+1*0.1+0.1=0.41+(1+ e0.4

9、 )=0.51111*0.1+1*0.1 + 1*0.1-0.1=0.21+(1+ e-0.2 )=0.78120.51*0.1+0.51*0.2-0.78*0.1=0.1+(1+)=0.92表3每个节点误差的计算单元jErrj120.92*(1-0.92) *(1-0.92)=0.0059110.78*(1-0.78)* 0.0059*(-0.1)=-0.00014100.51*(1-0.51)* 0.0059*(0.2)=0.0002990.51*(1-0.51)* 0.0059*(0.1)=0.00016表4权重和偏差更新计算权重或偏差新值W190.1+0.9*0.00016*1=0.1

10、W1100.1+0.9*0.00029*1=0.1W1110.1+0.9* (-0.00014 ) *1=0.1W290.2+0.9*0.00016*0=0.2W2100.2+0.9*0.00029*0=0.2W2110.2+0.9* (-0.00014 ) *0=0.2W390.1+0.9*0.00016*1=0.1W3100.1+0.9*0.00029*1=0.1W3110.1+0.9* (-0.00014 ) *1=0.1W490.2+0.9*0.00016*0=0.2W4100.2+0.9*0.00029*0=0.2W4110.2+0.9* (-0.00014 ) *0=0.2W590

11、.1+0.9*0.00016*1=0.1W5100.1+0.9*0.00029*1=0.1W51110.1+0.9* (-0.00014 ) *1=0.1W690.2+0.9*0.00016*0=0.2W6100.2+0.9*0.00029*0=0.2W6110.2+0.9* (-0.00014 ) *0=0.2W790.3+0.9*0.00016*0=0.3W7100.3+0.9*0.00029*0=0.3W7110.3+0.9* (-0.00014 ) *0=0.3W89-0.1+0.9*0.00016*0=-0.1W810-0.1+0.9*0.00029*0=-0.1W811-0.1+0

12、.9*(-0.00014 ) *0=-0.1W9120.1+0.9*0.0059*0.51=0.103W10120.2+0.9*0.0059*0.51=0.203W1112-0.1+0.9*0.0059*0.78=-0.09690.1+0.9*0.00016=0.1100.1+0.9*0.00029=0.111-0.1+0.9*(-0.00014 ) =-0.1120.2+0.9*0.0059=0.23.a) Suppose the fraction of undergraduate students who smoke is 15% and the fraction of graduate

13、students who smoke is 23%. If one- ? fth of the college students are graduate students and the rest are undergraduates, what is the probability that a student who smokes is a graduate student?U for Un dergraduate stude nt,G for Graduate stude nt. andS for Smok ing则，P(S|U) = 0.15, P(S|G) = 0.23, P(G)

14、 = 0.2, P(U) = 0.8.故 P(G|S)=P(S|G)xp(G)p(S)P(S|G) xp(G)P(S|U) x PU)+P(S|G)xp(G)0.23 X 0.2小 err0.15 x 0.8+0.23X (=? O277b) Given the information in part (a), is a randomly chosen college student more likely to be a graduate or undergraduate student?因为？ ?故 Undergraduate student,c) Suppose 30% of the g

15、raduate students live in a dorm but only 10% of the undergraduate students live in a dorm. If a student smokes and lives in the dorm, is he or she more likely to be a graduate or undergraduate student? You can assume independence between students who live in a dorm and those who smoke.令 D for Dorm.P

17、) The three cluster center after the first round execution第一轮:center A1(4,2,5) B1(1,1,1) C1(11,9,2)表格1各点与原始中心点距离A1A2A3B1B2B3C1C2C3C441051231119525713694165281292677piA17.355.923.745.743.745.484.58piB19.9010.052.459.645.8310.008.77piC14.128.7210.8211.0511.879.648.37 判断各点与中心点的距离（A1在表格中的点表示为（A4，A5，A6），

18、piA1表示各点到A1点的距离，piB1表示各点到B1点的距离，piC1表示各点到C1点的距离，下同）由以上表格可知：Clusterl : A1 A3 B3 C3 C4Cluster2 : B2 B1Cluster3 : C1 A2(b) The final three clusters第二轮：计算每簇的均值。Cluster1 : M1(5.2, 4.4, 7.2 )Cluster2 : M2 (1.5, 2, 1.5 )Cluster3 : M3(10.5, 7, 2)各点到簇中心点的距离：表格2各点与第一次聚类中心点距离A1A2A3B1B2B3C1C2C3C441051231119525

19、713694165281292677piM13.477.102.738.226.263.269.054.395.101.62piM24.309.038.9311.814.959.357.65piM38.732.068.1411.289.3910.312.0610.747.957.50再次聚类后的类簇为:Cluster1:A1 A3 B3 C3 C4Cluster2:B2 B1Cluster3:C1 A2结果分析：第二轮聚类结果与第一轮一致，故算法停止Part II: LabQuestio n 11. Build a decision tree using data se

20、t“transactions ” that predicts milk as a function of theother fields. Set the “ type ” of each field to “ Flag ” , set the “ direction ” of “milk ” as “out ” , set the“type ” of COD as “Typeless ” , select “ Expert ” and set the “ pruningTransactions.tJdTypeseverity ” to 65, and set the “ minimum re

21、cords per child branch ” to be 95. Hand-in: A figure showing your tree.rollouttdTable i厶 arz_3 a=.rnl-JM-l_i J二 =.：1 1mH -_- ” 2. Use the model (the full tree generated by Clementine in step 1 above) to make a predictionfor each of the 20 customers in the“ rollout ” data to determine whether the cus

22、tomer would buy milk. Hand-in: your prediction for each of the 20 customers.JEle J f咄加沁fh ttA陌b弼佛brwfpi flrjur -rar 龙岡旅 kn 俐 w 驱:朋活：n hi? fsn in :和 rb 丁制啊 K-nk11d ofld oQD 0J0J0:11DQ0D【即Qi：i1 c)d aQ)11D)01okliifli i11 l)1)(11P1)(1ib1iifl0 01i 01)(11 hD(1Elliid1 c)1 flJ)J(1 1)D1liltG00001 0Q :T10M :0

23、Q0BDq o(i00 0dD0和:)000洌i0i d0D fl0 01DD1#91(i01 11 0110101引Dfr1(n100 0011D011则ii0(i0 0 a:n:vT JdieiD4 00u t1：I 0JQQJT:D00I5J51i01a iDD 0J0JDri:0Q01肌i01i o)i aJ)11D)(1即i_n:L?0a i)i 001(i)D)(11训1i01a q)i a1D(d ih)(11训V0i0Qa iD0 aJ11(in:0)D1IE1Q :0QQ1 QTd10M :D101创1q o01 0d0-4:00)1诃0(dQ0:J0 0iDfl01腑由程序

24、运行的结果可知：customer(2,3,4,5,9,10,13,14,17,18) 会购买 Milk 。3. Hand-in: rules for positive (yes) prediction of milk purchase identified from the decision tree(up to the fifth level. The root is considered as level 1). Compare with the rules generated byApriori in Homework 1, and submit your brief comments

25、on the rules (e.g., pruning effect)利用决策树产生的关联规则：凰日1已O aerate|魏|S 1slCM juices = 1 Mode: 1 water = 1 Mode: 1 c 19 water = 0 Modo: 0 p盘sta = 1 Mode: 1 19 pasta - o Mode: ojtomato souce = 1 (Mode 11 c 19 tomatosouce = C Moae:0 blscults= 1 (Mode: 1 1 biscuits = 0 lMode: 0 O 0书 juices - 0 Mode: 0 |9 yogh

26、url= 1 Mode: 1 water = 1 D Mode: 1 i= 19 v/ater = n Mode: 0biscuits -1 Mode: 1 19 biscuits = 0 Mode:0brioches = 1 M&ds: 1 =5 1& brioches = 0 Made 0 yoQhur1= 0 Mode 0Q bcer= 1 Made: 0 bicuit = 1 Mode; 1 力 19 biscuits = 0 Mode:0rice二 1 Mode-1 1& rice = 0 Mode, t)9 beer= 0 Kflade: p frozen vegetables =

27、 1 Mode: 0- biscuits = 1 | Mode： 1& biscuits 0 Mode: 0 9 frazer vegelables - 0 Mode: 0& pasta = 1 Mode: 0 & pasta = 0 Mode: 0Table 1决策树产生的关联规则Con seque ntAn tecede nt1An tecede nt2milkJuicemilkJuicewatermilkpastamilkJuicepastamilkTomato sourcemilkJuiceTomato sourcemilkbiscuitsmilkJuicebiscuitsmilkYo

28、ghurtmilkYoghurtwatermilkYoghurtbiscuitsmilkBriochesmilkYoghurtBriochesmilkbeermilkbeerbiscuitsmilkricemilkbeerricemilkFroze n vegetablesmilkFroze n vegetablesbiscuitsTable 2 Apriori产生的关联规则ConsequentArrtacedant1|Aritecedantmilkbiscuitswatermilkyoghurtpasiamilkbiscuitspastamilkbriochespasia卩诵w怡tomato

29、 kmilkpastamilkwaterpastamilktomato soupasiamilkjuices卩浙怕tarnato sou .milkyoghurtpa slaricemilkbiscuitsmilktornato sou .li2 Sort by:可以说决策树产生的关联规则和Apriori产生的关联规则是相似的。在决策树中少了部分规则是因为这些规则在第六以及第七层以下，被剪枝。Questi on 2: Chur n Man ageme nt1. Perform decision tree classification on training data set. S

30、elect all the input variables except state, area_code, and phone_number (since they are only informative for this analysis). Set the“ Direction ” of class as “ out ” ,“type ” as “Flag ” . Then, specify the “ minimum recordsper child branch ” as 40,“ pruning severity ” as 70, click“use global pruning ” . Hand-inthe confusion matrices for validation data.chum training.txt()churn validation.txtclassSR-classx classBl Matrix of SRtlats by cists 1 X凰 Fil

人人文库> 全部分类> 行业资料 > 信息产业

温馨提示

1. 本站所有资源如无特殊说明，都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
2. 本站的文档不包含任何第三方提供的附件图纸等，如果需要附件，请联系上传者。文件的所有权益归上传用户所有。
3. 本站RAR压缩包中若带图纸，网页内容里面会有图纸预览，若没有图纸预览就没有图纸。
4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
5. 人人文库网仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对用户上传分享的文档内容本身不做任何修改或编辑，并不能对任何下载内容负责。
6. 下载文件中如有侵权或不适当内容，请与我们联系，我们立即纠正。
7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

中科院刘莹大数据挖掘课程作业2

文档简介

温馨提示

最新文档

评论

中科院刘莹大数据挖掘课程作业2

文档简介

温馨提示

最新文档

评论

相关文档