ClassificationwithDecisionTrees_第1页
ClassificationwithDecisionTrees_第2页
ClassificationwithDecisionTrees_第3页
ClassificationwithDecisionTrees_第4页
ClassificationwithDecisionTrees_第5页
已阅读5页,还剩36页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

1、1Classification with Decision TreesInstructor: Qiang YangHong Kong University of Science and TechnologyQyangcs.ust.hkThanks: Eibe Frank and Jiawei Han2DECISION TREE Quinlan93nAn internal node represents a test on an attribute.nA branch represents an outcome of the test, e.g., Color=red.nA leaf node

2、represents a class label or class label distribution.nAt each node, one attribute is chosen to split training examples into distinct classes as much as possiblenA new case is classified by following a matching path to a leaf node. Outlook Tempreature Humidity Windy ClasssunnyhothighfalseNsunnyhothig

3、htrueNovercast hothighfalsePrainmildhighfalsePraincoolnormalfalsePraincoolnormaltrueNovercast coolnormaltruePsunnymildhighfalseNsunnycoolnormalfalsePrainmildnormalfalsePsunnymildnormaltruePovercast mildhightruePovercast hotnormalfalsePrainmildhightrueNTraining SetOutlookovercasthumiditywindyhighnorm

4、alfalsetruesunnyrainNNPPPovercastExample5Building Decision Tree Q93nTop-down tree constructionnAt start, all training examples are at the root.nPartition the examples recursively by choosing one attribute each time.nBottom-up tree pruningnRemove subtrees or branches, in a bottom-up manner, to improv

5、e the estimated accuracy on new cases.6Choosing the Splitting Attribute nAt each node, available attributes are evaluated on the basis of separating the classes of the training examples. A Goodness function is used for this purpose.nTypical goodness functions:ninformation gain (ID3/C4.5)ninformation

6、 gain rationgini index7Which attribute to select?8A criterion for attribute selectionnWhich is the best attribute?nThe one which will result in the smallest treenHeuristic: choose the attribute that produces the “purest” nodesnPopular impurity criterion: information gainnInformation gain increases w

7、ith the average purity of the subsets that an attribute producesnStrategy: choose attribute that results in greatest information gain9Computing informationnInformation is measured in bitsnGiven a probability distribution, the info required to predict an event is the distributions entropynEntropy giv

8、es the information required in bits (this can involve fractions of bits!)nFormula for computing the entropy:nSuppose a set S has n values: V1, V2, Vn, where Vi has proportion pi,nE.g., the weather data has 2 values: Play=P and Play=N. Thus, p1=9/14, p2=5/14.nnnppppppppplogloglog),entropy(22112110Exa

9、mple: attribute “Outlook” n“Outlook” = “Sunny”:n“Outlook” = “Overcast”:n“Outlook” = “Rainy”:nExpected information for attribute:bits 971. 0)5/3log(5/3)5/2log(5/25,3/5)entropy(2/)info(2,3bits 0)0log(0) 1log(10)entropy(1,)info(4,0bits 971. 0)5/2log(5/2)5/3log(5/35,2/5)entropy(3/)info(3,2Note: this isn

10、ormally notdefined.971. 0)14/5(0)14/4(971. 0)14/5(3,2)4,0,info(3,2bits 693. 011Computing the information gainnInformation gain: information before splitting information after splittingnInformation gain for attributes from weather data:0.693-0.9403,2)4,0,info(2,3-)info(9,5)Outlookgain(bits 247. 0bits

11、 247. 0)Outlookgain(bits 029. 0)eTemperaturgain(bits 152. 0)Humiditygain(bits 048. 0)Windygain(12Continuing to splitbits 571. 0)eTemperaturgain(bits 971. 0)Humiditygain(bits 020. 0)Windygain(13The final decision treenNote: not all leaves need to be pure; sometimes identical instances have different

12、classes Splitting stops when data cant be split any further14Highly-branching attributesnProblematic: attributes with a large number of values (extreme case: ID code)nSubsets are more likely to be pure if there is a large number of valuesInformation gain is biased towards choosing attributes with a

13、large number of valuesThis may result in overfitting (selection of an attribute that is non-optimal for prediction)nAnother problem: fragmentation15The gain rationGain ratio: a modification of the information gain that reduces its bias on high-branch attributesnGain ratio takes number and size of br

14、anches into account when choosing an attributenIt corrects the information gain by taking the intrinsic information of a split into accountnAlso called split rationIntrinsic information: entropy of distribution of instances into branches n(i.e. how much info do we need to tell which branch an instan

15、ce belongs to).|2log|),(SiSSiSASnfoIntrinsicI.),(),(),(ASnfoIntrinsicIASGainASGainRatioGain RationIntrinsicInfo should be nLarge when data is evenly spread over all branchesnSmall when all data belong to one branchnGain ratio (Quinlan86) normalizes info gain by IntrinsicInfo:17Computing the gain rat

16、ionExample: intrinsic information for ID codenImportance of attribute decreases as intrinsic information gets largernExample of gain ratio:nExample:bits 807. 3)14/1log14/1(14),11,1,(info)Attributeinfo(intrinsic_)Attributegain()Attribute(gain_ratio246. 0bits 3.807bits 0.940)ID_code(gain_ratio18Gain r

17、atios for weather dataOutlookTemperatureInfo:0.693Info:0.911Gain: 0.940-0.6930.247 Gain: 0.940-0.911 0.029Split info: info(5,4,5)1.577 Split info: info(4,6,4)1.362Gain ratio: 0.247/1.5770.156Gain ratio: 0.029/1.3620.021HumidityWindyInfo:0.788Info:0.892Gain: 0.940-0.7880.152Gain: 0.940-0.892 0.048Spl

18、it info: info(7,7)1.000 Split info: info(8,6)0.985Gain ratio: 0.152/10.152Gain ratio: 0.048/0.9850.04919More on the gain ration“Outlook” still comes out topnHowever: “ID code” has greater gain rationStandard fix: ad hoc test to prevent splitting on that type of attributenProblem with gain ratio: it

19、may overcompensatenMay choose an attribute just because its intrinsic information is very lownStandard fix: nFirst, only consider attributes with greater than average information gainnThen, compare them on gain rationjjpTgini121)(splitginiNTNTTNginiNgini( )()()1122Gini IndexnIf a data set T contains

20、 examples from n classes, gini index, gini(T) is defined as where pj is the relative frequency of class j in T. gini(T) is minimized if the classes in T are skewed.nAfter splitting T into two subsets T1 and T2 with sizes N1 and N2, the gini index of the split data is defined asnThe attribute providi

21、ng smallest ginisplit(T) is chosen to split the node.21DiscussionnConsider the following variations of decision trees221. Apply KNN to each leaf nodenInstead of choosing a class label as the majority class label, use KNN to choose a class label232. Apply Nave Bayesian at each leaf nodenFor each leav

22、e node, use all the available information we know about the test case to make decisionsnInstead of using the majority rule, use probability/likelihood to make decisions243. Use error rates instead of entropynIf a node has N1 positive class labels P, and N2 negative class labels N, nIf N1 N2, then ch

23、oose PnThe error rate = N2/(N1+N2) at this nodenThe expected error at a parent node can be calculated as weighted sum of the error rates at each child nodenThe weights are the proportion of training data in each child25Cost Sensitive Decision TreesnWhen the FP and FN have different costs, the leaf n

24、ode label is different depending on the costs:nIf growing a tree has a smaller total cost, nthen choose an attribute with minimal total cost. nOtherwise, stop and form a leaf.nLabel leaf according to minimal total cost:nSuppose the leaf have P positive examples and N negative examplesnFP denotes the

25、 cost of a false positive example and FN false negativenIf (PFN NFP) THEN label = positive ELSE label = negative265. When there is missing value in the test data, we allow tests to be donenAttribute selection criterion: minimal total cost (Ctotal = Cmc + Ctest) instead of minimal entropy in C4.5nTyp

26、ically, if there are missing values, then to obtain a value for a missing attribute (say Temperature) will incur new costnBut may increase accuracy of prediction, thus reducing the miss classification costsnIn general, there is a balance between the two costsnWe care about the total cost27Stopping C

27、riterianWhen all cases have the same class. The leaf node is labeled by this class.nWhen there is no available attribute. The leaf node is labeled by the majority class.nWhen the number of cases is less than a specified threshold. The leaf node is labeled by the majority class. nYou can make a decis

28、ion at every node in a decision tree!nHow?28PruningnPruning simplifies a decision tree to prevent overfitting to noise in the datanTwo main pruning strategies:1.Postpruning: takes a fully-grown decision tree and discards unreliable parts2.Prepruning: stops growing a branch when information becomes u

29、nreliablePostpruning preferred in practice because of early stopping in prepruning29PrepruningnUsually based on statistical significance testnStops growing the tree when there is no statistically significant association between any attribute and the class at a particular nodenMost popular test: chi-

30、squared testnID3 used chi-squared test in addition to information gainnOnly statistically significant attributes where allowed to be selected by information gain procedure30PostpruningnBuilds full tree first and prunes it afterwardsnAttribute interactions are visible in fully-grown treenProblem: ide

31、ntification of subtrees and nodes that are due to chance effectsnTwo main pruning operations: 1.Subtree replacement2.Subtree raisingnPossible strategies: error estimation, significance testing, MDL principle31Subtree replacementnBottom-up: tree is considered for replacement once all its subtrees hav

32、e been considered32Estimating error ratesnPruning operation is performed if this does not increase the estimated errornOf course, error on the training data is not a useful estimator (would result in almost no pruning)nOne possibility: using hold-out set for pruning (reduced-error pruning)nC4.5s met

33、hod: using upper limit of 25% confidence interval derived from the training datanStandard Bernoulli-process-based method33Post-pruning in C4.5nBottom-up pruning: at each non-leaf node v, if merging the subtree at v into a leaf node improves accuracy, perform the merging.nMethod 1: compute accuracy u

34、sing examples not seen by the algorithm.nMethod 2: estimate accuracy using the training examples:nConsider classifying E examples incorrectly out of N examples as observing E events in N trials in the binomial distribution. nFor a given confidence level CF, the upper limit on the error rate over the

35、 whole population is with CF% confidence. ),(NEUCF34nUsage in Statistics: Sampling error estimationnExample:npopulation: 1,000,000 people, npopulation mean: percentage of the left handed peoplensample: 100 peoplensample mean: 6 left-handednHow to estimate the REAL population mean?Pessimistic Estimat

36、eU0.25(100,6)L0.25(100,6)6Possibility(%)21025% confidence interval35C4.5s methodnError estimate for subtree is weighted sum of error estimates for all its leavesnError estimate for a node:nIf c = 25% then z = 0.69 (from normal distribution)nf is the error on the training datanN is the number of inst

37、ances covered by the leafNzNzNfNfzNzfe2222214236ExampleOutlookyesyesnosunnyovercastcloudyyes?37Example cont.nConsider a subtree rooted at Outlook with 3 leaf nodes:nSunny: Play = yes : (0 error, 6 instances)nOvercast: Play= yes: (0 error, 9 instances)nCloudy: Play = no (0 error, 1 instance)nThe estimated

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论