ClassificationwithDecisionTrees_第1頁
ClassificationwithDecisionTrees_第2頁
ClassificationwithDecisionTrees_第3頁
ClassificationwithDecisionTrees_第4頁
ClassificationwithDecisionTrees_第5頁
已閱讀5頁,還剩36頁未讀, 繼續(xù)免費(fèi)閱讀

下載本文檔

版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請(qǐng)進(jìn)行舉報(bào)或認(rèn)領(lǐng)

文檔簡介

1、1Classification with Decision TreesInstructor: Qiang YangHong Kong University of Science and TechnologyQyangcs.ust.hkThanks: Eibe Frank and Jiawei Han2DECISION TREE Quinlan93nAn internal node represents a test on an attribute.nA branch represents an outcome of the test, e.g., Color=red.nA leaf node

2、represents a class label or class label distribution.nAt each node, one attribute is chosen to split training examples into distinct classes as much as possiblenA new case is classified by following a matching path to a leaf node. Outlook Tempreature Humidity Windy ClasssunnyhothighfalseNsunnyhothig

3、htrueNovercast hothighfalsePrainmildhighfalsePraincoolnormalfalsePraincoolnormaltrueNovercast coolnormaltruePsunnymildhighfalseNsunnycoolnormalfalsePrainmildnormalfalsePsunnymildnormaltruePovercast mildhightruePovercast hotnormalfalsePrainmildhightrueNTraining SetOutlookovercasthumiditywindyhighnorm

4、alfalsetruesunnyrainNNPPPovercastExample5Building Decision Tree Q93nTop-down tree constructionnAt start, all training examples are at the root.nPartition the examples recursively by choosing one attribute each time.nBottom-up tree pruningnRemove subtrees or branches, in a bottom-up manner, to improv

5、e the estimated accuracy on new cases.6Choosing the Splitting Attribute nAt each node, available attributes are evaluated on the basis of separating the classes of the training examples. A Goodness function is used for this purpose.nTypical goodness functions:ninformation gain (ID3/C4.5)ninformation

6、 gain rationgini index7Which attribute to select?8A criterion for attribute selectionnWhich is the best attribute?nThe one which will result in the smallest treenHeuristic: choose the attribute that produces the “purest” nodesnPopular impurity criterion: information gainnInformation gain increases w

7、ith the average purity of the subsets that an attribute producesnStrategy: choose attribute that results in greatest information gain9Computing informationnInformation is measured in bitsnGiven a probability distribution, the info required to predict an event is the distributions entropynEntropy giv

8、es the information required in bits (this can involve fractions of bits!)nFormula for computing the entropy:nSuppose a set S has n values: V1, V2, Vn, where Vi has proportion pi,nE.g., the weather data has 2 values: Play=P and Play=N. Thus, p1=9/14, p2=5/14.nnnppppppppplogloglog),entropy(22112110Exa

9、mple: attribute “Outlook” n“Outlook” = “Sunny”:n“Outlook” = “Overcast”:n“Outlook” = “Rainy”:nExpected information for attribute:bits 971. 0)5/3log(5/3)5/2log(5/25,3/5)entropy(2/)info(2,3bits 0)0log(0) 1log(10)entropy(1,)info(4,0bits 971. 0)5/2log(5/2)5/3log(5/35,2/5)entropy(3/)info(3,2Note: this isn

10、ormally notdefined.971. 0)14/5(0)14/4(971. 0)14/5(3,2)4,0,info(3,2bits 693. 011Computing the information gainnInformation gain: information before splitting information after splittingnInformation gain for attributes from weather data:0.693-0.9403,2)4,0,info(2,3-)info(9,5)Outlookgain(bits 247. 0bits

11、 247. 0)Outlookgain(bits 029. 0)eTemperaturgain(bits 152. 0)Humiditygain(bits 048. 0)Windygain(12Continuing to splitbits 571. 0)eTemperaturgain(bits 971. 0)Humiditygain(bits 020. 0)Windygain(13The final decision treenNote: not all leaves need to be pure; sometimes identical instances have different

12、classes Splitting stops when data cant be split any further14Highly-branching attributesnProblematic: attributes with a large number of values (extreme case: ID code)nSubsets are more likely to be pure if there is a large number of valuesInformation gain is biased towards choosing attributes with a

13、large number of valuesThis may result in overfitting (selection of an attribute that is non-optimal for prediction)nAnother problem: fragmentation15The gain rationGain ratio: a modification of the information gain that reduces its bias on high-branch attributesnGain ratio takes number and size of br

14、anches into account when choosing an attributenIt corrects the information gain by taking the intrinsic information of a split into accountnAlso called split rationIntrinsic information: entropy of distribution of instances into branches n(i.e. how much info do we need to tell which branch an instan

15、ce belongs to).|2log|),(SiSSiSASnfoIntrinsicI.),(),(),(ASnfoIntrinsicIASGainASGainRatioGain RationIntrinsicInfo should be nLarge when data is evenly spread over all branchesnSmall when all data belong to one branchnGain ratio (Quinlan86) normalizes info gain by IntrinsicInfo:17Computing the gain rat

16、ionExample: intrinsic information for ID codenImportance of attribute decreases as intrinsic information gets largernExample of gain ratio:nExample:bits 807. 3)14/1log14/1(14),11,1,(info)Attributeinfo(intrinsic_)Attributegain()Attribute(gain_ratio246. 0bits 3.807bits 0.940)ID_code(gain_ratio18Gain r

17、atios for weather dataOutlookTemperatureInfo:0.693Info:0.911Gain: 0.940-0.6930.247 Gain: 0.940-0.911 0.029Split info: info(5,4,5)1.577 Split info: info(4,6,4)1.362Gain ratio: 0.247/1.5770.156Gain ratio: 0.029/1.3620.021HumidityWindyInfo:0.788Info:0.892Gain: 0.940-0.7880.152Gain: 0.940-0.892 0.048Spl

18、it info: info(7,7)1.000 Split info: info(8,6)0.985Gain ratio: 0.152/10.152Gain ratio: 0.048/0.9850.04919More on the gain ration“Outlook” still comes out topnHowever: “ID code” has greater gain rationStandard fix: ad hoc test to prevent splitting on that type of attributenProblem with gain ratio: it

19、may overcompensatenMay choose an attribute just because its intrinsic information is very lownStandard fix: nFirst, only consider attributes with greater than average information gainnThen, compare them on gain rationjjpTgini121)(splitginiNTNTTNginiNgini( )()()1122Gini IndexnIf a data set T contains

20、 examples from n classes, gini index, gini(T) is defined as where pj is the relative frequency of class j in T. gini(T) is minimized if the classes in T are skewed.nAfter splitting T into two subsets T1 and T2 with sizes N1 and N2, the gini index of the split data is defined asnThe attribute providi

21、ng smallest ginisplit(T) is chosen to split the node.21DiscussionnConsider the following variations of decision trees221. Apply KNN to each leaf nodenInstead of choosing a class label as the majority class label, use KNN to choose a class label232. Apply Nave Bayesian at each leaf nodenFor each leav

22、e node, use all the available information we know about the test case to make decisionsnInstead of using the majority rule, use probability/likelihood to make decisions243. Use error rates instead of entropynIf a node has N1 positive class labels P, and N2 negative class labels N, nIf N1 N2, then ch

23、oose PnThe error rate = N2/(N1+N2) at this nodenThe expected error at a parent node can be calculated as weighted sum of the error rates at each child nodenThe weights are the proportion of training data in each child25Cost Sensitive Decision TreesnWhen the FP and FN have different costs, the leaf n

24、ode label is different depending on the costs:nIf growing a tree has a smaller total cost, nthen choose an attribute with minimal total cost. nOtherwise, stop and form a leaf.nLabel leaf according to minimal total cost:nSuppose the leaf have P positive examples and N negative examplesnFP denotes the

25、 cost of a false positive example and FN false negativenIf (PFN NFP) THEN label = positive ELSE label = negative265. When there is missing value in the test data, we allow tests to be donenAttribute selection criterion: minimal total cost (Ctotal = Cmc + Ctest) instead of minimal entropy in C4.5nTyp

26、ically, if there are missing values, then to obtain a value for a missing attribute (say Temperature) will incur new costnBut may increase accuracy of prediction, thus reducing the miss classification costsnIn general, there is a balance between the two costsnWe care about the total cost27Stopping C

27、riterianWhen all cases have the same class. The leaf node is labeled by this class.nWhen there is no available attribute. The leaf node is labeled by the majority class.nWhen the number of cases is less than a specified threshold. The leaf node is labeled by the majority class. nYou can make a decis

28、ion at every node in a decision tree!nHow?28PruningnPruning simplifies a decision tree to prevent overfitting to noise in the datanTwo main pruning strategies:1.Postpruning: takes a fully-grown decision tree and discards unreliable parts2.Prepruning: stops growing a branch when information becomes u

29、nreliablePostpruning preferred in practice because of early stopping in prepruning29PrepruningnUsually based on statistical significance testnStops growing the tree when there is no statistically significant association between any attribute and the class at a particular nodenMost popular test: chi-

30、squared testnID3 used chi-squared test in addition to information gainnOnly statistically significant attributes where allowed to be selected by information gain procedure30PostpruningnBuilds full tree first and prunes it afterwardsnAttribute interactions are visible in fully-grown treenProblem: ide

31、ntification of subtrees and nodes that are due to chance effectsnTwo main pruning operations: 1.Subtree replacement2.Subtree raisingnPossible strategies: error estimation, significance testing, MDL principle31Subtree replacementnBottom-up: tree is considered for replacement once all its subtrees hav

32、e been considered32Estimating error ratesnPruning operation is performed if this does not increase the estimated errornOf course, error on the training data is not a useful estimator (would result in almost no pruning)nOne possibility: using hold-out set for pruning (reduced-error pruning)nC4.5s met

33、hod: using upper limit of 25% confidence interval derived from the training datanStandard Bernoulli-process-based method33Post-pruning in C4.5nBottom-up pruning: at each non-leaf node v, if merging the subtree at v into a leaf node improves accuracy, perform the merging.nMethod 1: compute accuracy u

34、sing examples not seen by the algorithm.nMethod 2: estimate accuracy using the training examples:nConsider classifying E examples incorrectly out of N examples as observing E events in N trials in the binomial distribution. nFor a given confidence level CF, the upper limit on the error rate over the

35、 whole population is with CF% confidence. ),(NEUCF34nUsage in Statistics: Sampling error estimationnExample:npopulation: 1,000,000 people, npopulation mean: percentage of the left handed peoplensample: 100 peoplensample mean: 6 left-handednHow to estimate the REAL population mean?Pessimistic Estimat

36、eU0.25(100,6)L0.25(100,6)6Possibility(%)21025% confidence interval35C4.5s methodnError estimate for subtree is weighted sum of error estimates for all its leavesnError estimate for a node:nIf c = 25% then z = 0.69 (from normal distribution)nf is the error on the training datanN is the number of inst

37、ances covered by the leafNzNzNfNfzNzfe2222214236ExampleOutlookyesyesnosunnyovercastcloudyyes?37Example cont.nConsider a subtree rooted at Outlook with 3 leaf nodes:nSunny: Play = yes : (0 error, 6 instances)nOvercast: Play= yes: (0 error, 9 instances)nCloudy: Play = no (0 error, 1 instance)nThe estimated

溫馨提示

  • 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請(qǐng)下載最新的WinRAR軟件解壓。
  • 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請(qǐng)聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
  • 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會(huì)有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
  • 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
  • 5. 人人文庫網(wǎng)僅提供信息存儲(chǔ)空間,僅對(duì)用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對(duì)用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對(duì)任何下載內(nèi)容負(fù)責(zé)。
  • 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請(qǐng)與我們聯(lián)系,我們立即糾正。
  • 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶因使用這些下載資源對(duì)自己和他人造成任何形式的傷害或損失。

評(píng)論

0/150

提交評(píng)論