版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請進行舉報或認領
文檔簡介
1、淺探關節(jié)鏡下盤狀半月板損傷的治療RecapTodays topicsFeature selection for text classificationMeasuring classification performanceNearest neighbor categorizationFeature Selection: Why?Text collections have a large number of features10,000 1,000,000 unique words and moreMake using a particular classifier feasibleSome c
2、lassifiers cant deal with 100,000s of featsReduce training timeTraining time for some methods is quadratic or worse in the number of features (e.g., logistic regression)Improve generalizationEliminate noise featuresAvoid overfittingRecap: Feature ReductionStandard ways of reducing feature space for
3、textStemmingLaugh, laughs, laughing, laughed - laughStop word removalE.g., eliminate all prepositionsConversion to lower caseTokenizationBreak on all special characters: fire-fighter - fire, fighterFeature SelectionYang and Pedersen 1997Comparison of different selection criteriaDF document frequency
4、IG information gainMI mutual informationCHI chi squareCommon strategyCompute statistic for each termKeep n terms with highest value of this statisticInformation Gain(Pointwise) Mutual InformationChi-SquareTerm presentTerm absentDocument belongs to categoryABDocument does not belong to categoryCDX2 =
5、 N(AD-BC)2 / ( (A+B) (A+C) (B+D) (C+D) )Use either maximum or average X2Value for complete independence?Document FrequencyNumber of documents a term occurs in Is sometimes used for eliminating both very frequent and very infrequent termsHow is document frequency measure different from the other 3 me
6、asures?Yang&Pedersen: ExperimentsTwo classification methodskNN (k nearest neighbors; more later)Linear Least Squares FitRegression methodCollectionsReuters-2217392 categories16,000 unique termsOhsumed: subset of medline14,000 categories72,000 unique termsLtc term weighting Yang&Pedersen: Experiments
7、Choose feature set sizePreprocess collection, discarding non-selected features / wordsApply term weighting - feature vector for each documentTrain classifier on training setEvaluate classifier on test setDiscussionYou can eliminate 90% of features for IG, DF, and CHI without decreasing performance.I
8、n fact, performance increases with fewer features for IG, DF, and CHI.Mutual information is very sensitive to small counts.IG does best with smallest number of features.Document frequency is close to optimal. By far the simplest feature selection method.Similar results for LLSF (regression).ResultsW
9、hy is selecting common terms a good strategy?IG, DF, CHI Are Correlated.Information Gain vs Mutual InformationInformation gain is similar to MI for random variablesIndependence?In contrast, pointwise MI ignores non-occurrence of termsE.g., for complete dependence, you get:P(AB)/ (P(A)P(B) = 1/P(A) l
10、arger for rare terms than for frequent termsYang&Pedersen: Pointwise MI favors rare termsFeature Selection:Other ConsiderationsGeneric vs Class-SpecificCompletely generic (class-independent)Separate feature set for each classMixed (a la Yang&Pedersen)Maintainability over timeIs aggressive features s
11、election good or bad for robustness over time?Ideal: Optimal features selected as part of trainingYang&Pedersen: LimitationsDont look at class specific feature selectionDont look at methods that cant handle high-dimensional spacesEvaluate category ranking (as opposed to classification accuracy)Featu
12、re Selection: Other MethodsStepwise term selection ForwardBackwardExpensive: need to do n2 iterations of trainingTerm clusteringDimension reduction: PCA / SVDWord Rep. vs. Dimension ReductionWord representations: one dimension for each word (binary, count, or weight)Dimension reduction: each dimensi
13、on is a unique linear combination of all words (linear case)Dimension reduction is good for generic topics (“politics”), bad for specific classes (“ruanda”). Why?SVD/PCA computationally expensiveHigher complexity in implementationNo clear examples of higher performance through dimension reductionWor
14、d Rep. vs. Dimension ReductionMeasuring ClassificationFigures of MeritAccuracy of classification Main evaluation criterion in academiaMore in a momenSpeed of training statistical classifierSpeed of classification (docs/hour)No big differences for most algorithmsExceptions: kNN, complex preprocessing
15、 requirementsEffort in creating training set (human hours/topic)More on this in Lecture 9 (Active Learning)Measures of AccuracyError rate Not a good measure for small classes. Why?Precision/recall for classification decisionsF1 measure: 1/F1 = (1/P + 1/R)Breakeven pointCorrect estimate of size of ca
16、tegoryWhy is this different?Precision/recall for ranking classesStability over time / concept driftUtilityPrecision/Recall for Ranking ClassesExample: “Bad wheat harvest in Turkey”True categoriesWheatTurkeyRanked category list0.9: turkey0.7: poultry0.5: armenia0.4: barley0.3: georgiaPrecision at 5:
17、0.1, Recall at 5: 0.5Precision/Recall for Ranking ClassesConsider problems with many categories (10)Use method returning scores comparable across categories (not: Nave Bayes)Rank categories and compute average precision recall (or other measure characterizing precision/recall curve)Good measure for
18、interactive support of human categorizationUseless for an “autonomous” system (e.g. a filter on a stream of newswire stories)Concept DriftCategories change over timeExample: “president of the united states”1999: clinton is great feature2002: clinton is bad featureOne measure of a text classification
19、 system is how well it protects against concept drift.Feature selection: good or bad to protect against concept drift?Micro- vs. Macro-AveragingIf we have more than one class, how do we combine multiple performance measures into one quantity?Macroaveraging: Compute performance for each class, then a
20、verage.Microaveraging: Collect decisions for all classes, compute contingency table, evaluate.Micro- vs. Macro-Averaging: ExampleTruth: yesTruth: noClassifier: yes1010Classifier: no10970Truth: yesTruth: noClassifier: yes9010Classifier: no10890Truth: yesTruth: noClassifier: yes10020Classifier: no2018
21、60Class 1Class 2Micro.Av. TableMacroaveraged precision: (0.5 + 0.9)/2 = 0.7Microaveraged precision: 100/120 = .83Why this difference?Reuters 1Newswire textStatistics (vary according to version used)Training set: 9,610Test set: 3,66250% of documents have no category assignedAverage document length: 9
22、0.6Number of classes: 92Example classes: currency exchange, wheat, goldMax classes assigned: 14Average number of classes assigned1.24 for docs with at least one categoryReuters 1Only about 10 out of 92 categories are largeMicroaveraging measures performance on large categories.Factors Affecting Meas
23、uresVariability of dataDocument size/lengthquality/style of authorshipuniformity of vocabularyVariability of “truth” / gold standardneed definitive judgement on which topic(s) a doc belongs tousually humanIdeally: consistent judgementsAccuracy measurementConfusion matrix53Topic assigned by classifie
24、rActual TopicThis (i, j) entry means 53 of the docs actually intopic i were put in topic j by the classifier.Confusion matrixFunction of classifier, topics and test docs.For a perfect classifier, all off-diagonal entries should be zero.For a perfect classifier, if there are n docs in category j than
25、 entry (j,j) should be n.Straightforward when there is 1 category per document.Can be extended to n categories per document.Confusion measures (1 class / doc)Recall: Fraction of docs in topic i classified correctly:Precision: Fraction of docs assigned topic i that are actually about topic i:“Correct
26、 rate”: (1- error rate) Fraction of docs classified correctly:Integrated Evaluation/OptimizationPrincipled approach to trainingOptimize the measure that performance is measured withs: vector of classifier decision, z: vector of true classesh(s,z) = cost of making decisions s for true assignments zUt
27、ility / CostOne cost function h is based on contingency table.Assume identical cost for all false positives etc.Cost C = l11 * A + l12 *B + l21*C + l22*DFor this cost c, we get the following optimality criterionTruth: yesTruth: noClassifier: yesCost:11Count:ACost:12Count:BClassifier: noCost:21Count;
28、CCost:22Count:DUtility / CostTruth: yesTruth: noClassifier: yes1112Classifier: no2122Most common cost: 1 for error, 0 for correct. Pi ? Product cross-sale: high cost for false positive, low cost for false negative.Patent search: low cost for false positive, high cost for false negative.Are All Optim
29、al Rules of Form p?In the above examples, all you need to do is estimate probability of class membership.Can all problems be solved like this?No!Probability is often not sufficientUser decision depends on the distribution of relevanceExample: information filter for terrorismNave BayesVector Space Cl
30、assificationNearest Neighbor ClassificationRecall Vector Space RepresentationEach doc j is a vector, one component for each term (= word).Normalize to unit length.Have a vector spaceterms are axesn docs live in this spaceeven with stemming, may have 10000+ dimensions, or even 1,000,000+Classificatio
31、n Using Vector SpacesEach training doc a point (vector) labeled by its topic (= class)Hypothesis: docs of the same topic form a contiguous region of spaceDefine surfaces to delineate topics in spaceTopics in a vector spaceGovernmentScienceArtsGiven a test docFigure out which region it lies inAssign
32、corresponding classTest doc = GovernmentGovernmentScienceArtsBinary ClassificationConsider 2 class problemsHow do we define (and find) the separating surface?How do we test which region a test doc is in?Separation by HyperplanesAssume linear separability for now:in 2 dimensions, can separate by a li
33、nein higher dimensions, need hyperplanesCan find separating hyperplane by linear programming (e.g. perceptron):separator can be expressed as ax + by = cLinear programming / PerceptronFind a,b,c, such thatax + by c for red pointsax + by c for green points.Relationship to Nave Bayes?Find a,b,c, such t
34、hatax + by c for red pointsax + by c for green points.Linear ClassifiersMany common text classifiers are linear classifiersDespite this similarity, large performance differencesFor separable problems, there is an infinite number of separating hyperplanes. Which one do you choose?What to do for non-s
35、eparable problems?Which hyperplane?In general, lots of possiblesolutions for a,b,c.Support Vector Machine (SVM)Support vectorsMaximizemarginQuadratic programming problem The decision function is fully specified by subset of training samples, the support vectors.Text classification method du jourTopi
36、c of lecture 9Category: InterestExample SVM features wi ti wi ti 0.70 prime 0.67 rate 0.63 interest 0.60 rates 0.46 discount 0.43 bundesbank 0.43 baker -0.71 dlrs -0.35 world -0.33 sees -0.25 year -0.24 group -0.24 dlr -0.24 januaryMore Than Two ClassesAny-of or multiclass classificationFor n classe
37、s, decompose into n binary problemsOne-of classification: each document belongs to exactly one classHow do we compose separating surfaces into regions?Centroid classificationK nearest neighbor classificationComposing Surfaces: Issues?Separating Multiple TopicsBuild a separator between each topic and
38、 its complementary set (docs from all other topics).Given test doc, evaluate it for membership in each topic.Declare membership in topics One-of classification: for class with maximum score/confidence/probabilityMulticlass classification:For classes above thresholdNegative examplesFormulate as above
39、, except negative examples for a topic are added to its complementary set.Positive examplesNegative examplesCentroid ClassificationGiven training docs for a topic, compute their centroidNow have a centroid for each topicGiven query doc, assign to topic whose centroid is nearest.Exercise: Compare to
40、RocchioExampleGovernmentScienceArtsk Nearest Neighbor ClassificationTo classify document d into class cDefine k-neighborhood N as k nearest neighbors of dCount number of documents l in N that belong to cEstimate P(c|d) as l/kCover and Hart 1967Asymptotically, the error rate of 1-nearest-neighbor classification is less than twice the Bayes
溫馨提示
- 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
- 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
- 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會有圖紙預覽,若沒有圖紙預覽就沒有圖紙。
- 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
- 5. 人人文庫網(wǎng)僅提供信息存儲空間,僅對用戶上傳內(nèi)容的表現(xiàn)方式做保護處理,對用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對任何下載內(nèi)容負責。
- 6. 下載文件中如有侵權(quán)或不適當內(nèi)容,請與我們聯(lián)系,我們立即糾正。
- 7. 本站不保證下載資源的準確性、安全性和完整性, 同時也不承擔用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。
最新文檔
- 【名師金典】2022新課標高考生物總復習限時檢測8降低化學反應活化能的酶-
- 陜西省西安市西咸新區(qū)部分學校2024-2025學年九年級上學期期末考試歷史試題(含答案)
- 期末測評卷(二)(Lesson10 ~ 12)綜合測評卷 2024-2025學年科普版(三起)英語五年級上冊(含答案)
- 《AIDS抗病毒治療》課件
- 【志鴻優(yōu)化設計】2020高考地理(人教版)一輪教學案:第6章-第1講人口的數(shù)量變化與合理容量
- 【復習參考】2020高考語文(江蘇)二輪專題訓練:專題4-散文閱讀-1句子作用分析題
- 【名師一號】2020-2021學年高中英語(北師大版)必修二-第五單元綜合測評
- 【高考總動員】2022屆高考語文一輪總復習-知識清單古代詩歌常識
- 【KS5U原創(chuàng)】新課標2021高二地理暑假作業(yè)四
- 同學造句子一年級簡單
- 2024-2030年中國無糖壓縮餅干行業(yè)市場現(xiàn)狀供需分析及投資評估規(guī)劃分析研究報告
- 安全管理三級體系
- 2024年商用密碼應用安全性評估從業(yè)人員考核試題庫-下(判斷題)
- 快樂讀書吧《愛的教育》復習小結(jié)(知識點)-統(tǒng)編版語文六年級上冊
- 2024年人教版初一生物(上冊)期末考卷及答案(各版本)
- 光伏發(fā)電工程建設標準工藝手冊(2023版)
- 2024至2030年中國無糖壓縮餅干行業(yè)市場全景監(jiān)測及投資策略研究報告
- 食品安全追溯管理體系制度
- 監(jiān)理工作質(zhì)量保證體系
- 辦公家具采購投標方案(投標書+技術(shù)方案)
- 律所之間轉(zhuǎn)委托合同范本
評論
0/150
提交評論