




版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
1、1Classification with Decision TreesInstructor: Qiang YangHong Kong University of Science and TechnologyQyangcs.ust.hkThanks: Eibe Frank and Jiawei Han2DECISION TREE Quinlan93nAn internal node represents a test on an attribute.nA branch represents an outcome of the test, e.g., Color=red.nA leaf node
2、represents a class label or class label distribution.nAt each node, one attribute is chosen to split training examples into distinct classes as much as possiblenA new case is classified by following a matching path to a leaf node. Outlook Tempreature Humidity Windy ClasssunnyhothighfalseNsunnyhothig
3、htrueNovercast hothighfalsePrainmildhighfalsePraincoolnormalfalsePraincoolnormaltrueNovercast coolnormaltruePsunnymildhighfalseNsunnycoolnormalfalsePrainmildnormalfalsePsunnymildnormaltruePovercast mildhightruePovercast hotnormalfalsePrainmildhightrueNTraining SetOutlookovercasthumiditywindyhighnorm
4、alfalsetruesunnyrainNNPPPovercastExample5Building Decision Tree Q93nTop-down tree constructionnAt start, all training examples are at the root.nPartition the examples recursively by choosing one attribute each time.nBottom-up tree pruningnRemove subtrees or branches, in a bottom-up manner, to improv
5、e the estimated accuracy on new cases.6Choosing the Splitting Attribute nAt each node, available attributes are evaluated on the basis of separating the classes of the training examples. A Goodness function is used for this purpose.nTypical goodness functions:ninformation gain (ID3/C4.5)ninformation
6、 gain rationgini index7Which attribute to select?8A criterion for attribute selectionnWhich is the best attribute?nThe one which will result in the smallest treenHeuristic: choose the attribute that produces the “purest” nodesnPopular impurity criterion: information gainnInformation gain increases w
7、ith the average purity of the subsets that an attribute producesnStrategy: choose attribute that results in greatest information gain9Computing informationnInformation is measured in bitsnGiven a probability distribution, the info required to predict an event is the distributions entropynEntropy giv
8、es the information required in bits (this can involve fractions of bits!)nFormula for computing the entropy:nSuppose a set S has n values: V1, V2, Vn, where Vi has proportion pi,nE.g., the weather data has 2 values: Play=P and Play=N. Thus, p1=9/14, p2=5/14.nnnppppppppplogloglog),entropy(22112110Exa
9、mple: attribute “Outlook” n“Outlook” = “Sunny”:n“Outlook” = “Overcast”:n“Outlook” = “Rainy”:nExpected information for attribute:bits 971. 0)5/3log(5/3)5/2log(5/25,3/5)entropy(2/)info(2,3bits 0)0log(0) 1log(10)entropy(1,)info(4,0bits 971. 0)5/2log(5/2)5/3log(5/35,2/5)entropy(3/)info(3,2Note: this isn
10、ormally notdefined.971. 0)14/5(0)14/4(971. 0)14/5(3,2)4,0,info(3,2bits 693. 011Computing the information gainnInformation gain: information before splitting information after splittingnInformation gain for attributes from weather data:0.693-0.9403,2)4,0,info(2,3-)info(9,5)Outlookgain(bits 247. 0bits
11、 247. 0)Outlookgain(bits 029. 0)eTemperaturgain(bits 152. 0)Humiditygain(bits 048. 0)Windygain(12Continuing to splitbits 571. 0)eTemperaturgain(bits 971. 0)Humiditygain(bits 020. 0)Windygain(13The final decision treenNote: not all leaves need to be pure; sometimes identical instances have different
12、classes Splitting stops when data cant be split any further14Highly-branching attributesnProblematic: attributes with a large number of values (extreme case: ID code)nSubsets are more likely to be pure if there is a large number of valuesInformation gain is biased towards choosing attributes with a
13、large number of valuesThis may result in overfitting (selection of an attribute that is non-optimal for prediction)nAnother problem: fragmentation15The gain rationGain ratio: a modification of the information gain that reduces its bias on high-branch attributesnGain ratio takes number and size of br
14、anches into account when choosing an attributenIt corrects the information gain by taking the intrinsic information of a split into accountnAlso called split rationIntrinsic information: entropy of distribution of instances into branches n(i.e. how much info do we need to tell which branch an instan
15、ce belongs to).|2log|),(SiSSiSASnfoIntrinsicI.),(),(),(ASnfoIntrinsicIASGainASGainRatioGain RationIntrinsicInfo should be nLarge when data is evenly spread over all branchesnSmall when all data belong to one branchnGain ratio (Quinlan86) normalizes info gain by IntrinsicInfo:17Computing the gain rat
16、ionExample: intrinsic information for ID codenImportance of attribute decreases as intrinsic information gets largernExample of gain ratio:nExample:bits 807. 3)14/1log14/1(14),11,1,(info)Attributeinfo(intrinsic_)Attributegain()Attribute(gain_ratio246. 0bits 3.807bits 0.940)ID_code(gain_ratio18Gain r
17、atios for weather dataOutlookTemperatureInfo:0.693Info:0.911Gain: 0.940-0.6930.247 Gain: 0.940-0.911 0.029Split info: info(5,4,5)1.577 Split info: info(4,6,4)1.362Gain ratio: 0.247/1.5770.156Gain ratio: 0.029/1.3620.021HumidityWindyInfo:0.788Info:0.892Gain: 0.940-0.7880.152Gain: 0.940-0.892 0.048Spl
18、it info: info(7,7)1.000 Split info: info(8,6)0.985Gain ratio: 0.152/10.152Gain ratio: 0.048/0.9850.04919More on the gain ration“Outlook” still comes out topnHowever: “ID code” has greater gain rationStandard fix: ad hoc test to prevent splitting on that type of attributenProblem with gain ratio: it
19、may overcompensatenMay choose an attribute just because its intrinsic information is very lownStandard fix: nFirst, only consider attributes with greater than average information gainnThen, compare them on gain rationjjpTgini121)(splitginiNTNTTNginiNgini( )()()1122Gini IndexnIf a data set T contains
20、 examples from n classes, gini index, gini(T) is defined as where pj is the relative frequency of class j in T. gini(T) is minimized if the classes in T are skewed.nAfter splitting T into two subsets T1 and T2 with sizes N1 and N2, the gini index of the split data is defined asnThe attribute providi
21、ng smallest ginisplit(T) is chosen to split the node.21DiscussionnConsider the following variations of decision trees221. Apply KNN to each leaf nodenInstead of choosing a class label as the majority class label, use KNN to choose a class label232. Apply Nave Bayesian at each leaf nodenFor each leav
22、e node, use all the available information we know about the test case to make decisionsnInstead of using the majority rule, use probability/likelihood to make decisions243. Use error rates instead of entropynIf a node has N1 positive class labels P, and N2 negative class labels N, nIf N1 N2, then ch
23、oose PnThe error rate = N2/(N1+N2) at this nodenThe expected error at a parent node can be calculated as weighted sum of the error rates at each child nodenThe weights are the proportion of training data in each child25Cost Sensitive Decision TreesnWhen the FP and FN have different costs, the leaf n
24、ode label is different depending on the costs:nIf growing a tree has a smaller total cost, nthen choose an attribute with minimal total cost. nOtherwise, stop and form a leaf.nLabel leaf according to minimal total cost:nSuppose the leaf have P positive examples and N negative examplesnFP denotes the
25、 cost of a false positive example and FN false negativenIf (PFN NFP) THEN label = positive ELSE label = negative265. When there is missing value in the test data, we allow tests to be donenAttribute selection criterion: minimal total cost (Ctotal = Cmc + Ctest) instead of minimal entropy in C4.5nTyp
26、ically, if there are missing values, then to obtain a value for a missing attribute (say Temperature) will incur new costnBut may increase accuracy of prediction, thus reducing the miss classification costsnIn general, there is a balance between the two costsnWe care about the total cost27Stopping C
27、riterianWhen all cases have the same class. The leaf node is labeled by this class.nWhen there is no available attribute. The leaf node is labeled by the majority class.nWhen the number of cases is less than a specified threshold. The leaf node is labeled by the majority class. nYou can make a decis
28、ion at every node in a decision tree!nHow?28PruningnPruning simplifies a decision tree to prevent overfitting to noise in the datanTwo main pruning strategies:1.Postpruning: takes a fully-grown decision tree and discards unreliable parts2.Prepruning: stops growing a branch when information becomes u
29、nreliablePostpruning preferred in practice because of early stopping in prepruning29PrepruningnUsually based on statistical significance testnStops growing the tree when there is no statistically significant association between any attribute and the class at a particular nodenMost popular test: chi-
30、squared testnID3 used chi-squared test in addition to information gainnOnly statistically significant attributes where allowed to be selected by information gain procedure30PostpruningnBuilds full tree first and prunes it afterwardsnAttribute interactions are visible in fully-grown treenProblem: ide
31、ntification of subtrees and nodes that are due to chance effectsnTwo main pruning operations: 1.Subtree replacement2.Subtree raisingnPossible strategies: error estimation, significance testing, MDL principle31Subtree replacementnBottom-up: tree is considered for replacement once all its subtrees hav
32、e been considered32Estimating error ratesnPruning operation is performed if this does not increase the estimated errornOf course, error on the training data is not a useful estimator (would result in almost no pruning)nOne possibility: using hold-out set for pruning (reduced-error pruning)nC4.5s met
33、hod: using upper limit of 25% confidence interval derived from the training datanStandard Bernoulli-process-based method33Post-pruning in C4.5nBottom-up pruning: at each non-leaf node v, if merging the subtree at v into a leaf node improves accuracy, perform the merging.nMethod 1: compute accuracy u
34、sing examples not seen by the algorithm.nMethod 2: estimate accuracy using the training examples:nConsider classifying E examples incorrectly out of N examples as observing E events in N trials in the binomial distribution. nFor a given confidence level CF, the upper limit on the error rate over the
35、 whole population is with CF% confidence. ),(NEUCF34nUsage in Statistics: Sampling error estimationnExample:npopulation: 1,000,000 people, npopulation mean: percentage of the left handed peoplensample: 100 peoplensample mean: 6 left-handednHow to estimate the REAL population mean?Pessimistic Estimat
36、eU0.25(100,6)L0.25(100,6)6Possibility(%)21025% confidence interval35C4.5s methodnError estimate for subtree is weighted sum of error estimates for all its leavesnError estimate for a node:nIf c = 25% then z = 0.69 (from normal distribution)nf is the error on the training datanN is the number of inst
37、ances covered by the leafNzNzNfNfzNzfe2222214236ExampleOutlookyesyesnosunnyovercastcloudyyes?37Example cont.nConsider a subtree rooted at Outlook with 3 leaf nodes:nSunny: Play = yes : (0 error, 6 instances)nOvercast: Play= yes: (0 error, 9 instances)nCloudy: Play = no (0 error, 1 instance)nThe estimated
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- 办公用品行业的数字化供应商关系管理优化
- 法学概论社会动态与法律适应性试题及答案
- 工业模具技术改造项目质量验收标准补充协议
- 网络管理员考试热点问题试题及答案
- 网络管理创新案例试题及答案
- 强调代码安全的重要性与实践试题及答案
- 2025年法学概论考试前的试题及答案速递
- 2025年VB专业考试试题及答案
- 解读2023年高考作文试题及答案
- 职业暴露护理规范与应对策略
- 江西省房屋市政工程专职安全生产管理人员安全日志
- 知行合一:王阳明传
- 广告宣传栏及雕塑采购项目服务投标方案(技术标)
- 国开《Windows网络操作系统管理》形考任务4-配置故障转移群集服务实训
- 波浪理论基础图解
- 基于单片机的五岔路口交通灯方案设计
- 角的度量说课PPT
- 肥皂盒模具毕业设计
- 【辅助投篮机器人设计7600字(论文)】
- 山东财经大学辅导员考试真题2022
- 电力QC小组成果报告电力QC小组成果报告八篇
评论
0/150
提交评论