淺探關節(jié)鏡下盤狀半月板損傷的治療_第1頁
淺探關節(jié)鏡下盤狀半月板損傷的治療_第2頁
淺探關節(jié)鏡下盤狀半月板損傷的治療_第3頁
淺探關節(jié)鏡下盤狀半月板損傷的治療_第4頁
淺探關節(jié)鏡下盤狀半月板損傷的治療_第5頁
已閱讀5頁,還剩61頁未讀, 繼續(xù)免費閱讀

下載本文檔

版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請進行舉報或認領

文檔簡介

1、淺探關節(jié)鏡下盤狀半月板損傷的治療RecapTodays topicsFeature selection for text classificationMeasuring classification performanceNearest neighbor categorizationFeature Selection: Why?Text collections have a large number of features10,000 1,000,000 unique words and moreMake using a particular classifier feasibleSome c

2、lassifiers cant deal with 100,000s of featsReduce training timeTraining time for some methods is quadratic or worse in the number of features (e.g., logistic regression)Improve generalizationEliminate noise featuresAvoid overfittingRecap: Feature ReductionStandard ways of reducing feature space for

3、textStemmingLaugh, laughs, laughing, laughed - laughStop word removalE.g., eliminate all prepositionsConversion to lower caseTokenizationBreak on all special characters: fire-fighter - fire, fighterFeature SelectionYang and Pedersen 1997Comparison of different selection criteriaDF document frequency

4、IG information gainMI mutual informationCHI chi squareCommon strategyCompute statistic for each termKeep n terms with highest value of this statisticInformation Gain(Pointwise) Mutual InformationChi-SquareTerm presentTerm absentDocument belongs to categoryABDocument does not belong to categoryCDX2 =

5、 N(AD-BC)2 / ( (A+B) (A+C) (B+D) (C+D) )Use either maximum or average X2Value for complete independence?Document FrequencyNumber of documents a term occurs in Is sometimes used for eliminating both very frequent and very infrequent termsHow is document frequency measure different from the other 3 me

6、asures?Yang&Pedersen: ExperimentsTwo classification methodskNN (k nearest neighbors; more later)Linear Least Squares FitRegression methodCollectionsReuters-2217392 categories16,000 unique termsOhsumed: subset of medline14,000 categories72,000 unique termsLtc term weighting Yang&Pedersen: Experiments

7、Choose feature set sizePreprocess collection, discarding non-selected features / wordsApply term weighting - feature vector for each documentTrain classifier on training setEvaluate classifier on test setDiscussionYou can eliminate 90% of features for IG, DF, and CHI without decreasing performance.I

8、n fact, performance increases with fewer features for IG, DF, and CHI.Mutual information is very sensitive to small counts.IG does best with smallest number of features.Document frequency is close to optimal. By far the simplest feature selection method.Similar results for LLSF (regression).ResultsW

9、hy is selecting common terms a good strategy?IG, DF, CHI Are Correlated.Information Gain vs Mutual InformationInformation gain is similar to MI for random variablesIndependence?In contrast, pointwise MI ignores non-occurrence of termsE.g., for complete dependence, you get:P(AB)/ (P(A)P(B) = 1/P(A) l

10、arger for rare terms than for frequent termsYang&Pedersen: Pointwise MI favors rare termsFeature Selection:Other ConsiderationsGeneric vs Class-SpecificCompletely generic (class-independent)Separate feature set for each classMixed (a la Yang&Pedersen)Maintainability over timeIs aggressive features s

11、election good or bad for robustness over time?Ideal: Optimal features selected as part of trainingYang&Pedersen: LimitationsDont look at class specific feature selectionDont look at methods that cant handle high-dimensional spacesEvaluate category ranking (as opposed to classification accuracy)Featu

12、re Selection: Other MethodsStepwise term selection ForwardBackwardExpensive: need to do n2 iterations of trainingTerm clusteringDimension reduction: PCA / SVDWord Rep. vs. Dimension ReductionWord representations: one dimension for each word (binary, count, or weight)Dimension reduction: each dimensi

13、on is a unique linear combination of all words (linear case)Dimension reduction is good for generic topics (“politics”), bad for specific classes (“ruanda”). Why?SVD/PCA computationally expensiveHigher complexity in implementationNo clear examples of higher performance through dimension reductionWor

14、d Rep. vs. Dimension ReductionMeasuring ClassificationFigures of MeritAccuracy of classification Main evaluation criterion in academiaMore in a momenSpeed of training statistical classifierSpeed of classification (docs/hour)No big differences for most algorithmsExceptions: kNN, complex preprocessing

15、 requirementsEffort in creating training set (human hours/topic)More on this in Lecture 9 (Active Learning)Measures of AccuracyError rate Not a good measure for small classes. Why?Precision/recall for classification decisionsF1 measure: 1/F1 = (1/P + 1/R)Breakeven pointCorrect estimate of size of ca

16、tegoryWhy is this different?Precision/recall for ranking classesStability over time / concept driftUtilityPrecision/Recall for Ranking ClassesExample: “Bad wheat harvest in Turkey”True categoriesWheatTurkeyRanked category list0.9: turkey0.7: poultry0.5: armenia0.4: barley0.3: georgiaPrecision at 5:

17、0.1, Recall at 5: 0.5Precision/Recall for Ranking ClassesConsider problems with many categories (10)Use method returning scores comparable across categories (not: Nave Bayes)Rank categories and compute average precision recall (or other measure characterizing precision/recall curve)Good measure for

18、interactive support of human categorizationUseless for an “autonomous” system (e.g. a filter on a stream of newswire stories)Concept DriftCategories change over timeExample: “president of the united states”1999: clinton is great feature2002: clinton is bad featureOne measure of a text classification

19、 system is how well it protects against concept drift.Feature selection: good or bad to protect against concept drift?Micro- vs. Macro-AveragingIf we have more than one class, how do we combine multiple performance measures into one quantity?Macroaveraging: Compute performance for each class, then a

20、verage.Microaveraging: Collect decisions for all classes, compute contingency table, evaluate.Micro- vs. Macro-Averaging: ExampleTruth: yesTruth: noClassifier: yes1010Classifier: no10970Truth: yesTruth: noClassifier: yes9010Classifier: no10890Truth: yesTruth: noClassifier: yes10020Classifier: no2018

21、60Class 1Class 2Micro.Av. TableMacroaveraged precision: (0.5 + 0.9)/2 = 0.7Microaveraged precision: 100/120 = .83Why this difference?Reuters 1Newswire textStatistics (vary according to version used)Training set: 9,610Test set: 3,66250% of documents have no category assignedAverage document length: 9

22、0.6Number of classes: 92Example classes: currency exchange, wheat, goldMax classes assigned: 14Average number of classes assigned1.24 for docs with at least one categoryReuters 1Only about 10 out of 92 categories are largeMicroaveraging measures performance on large categories.Factors Affecting Meas

23、uresVariability of dataDocument size/lengthquality/style of authorshipuniformity of vocabularyVariability of “truth” / gold standardneed definitive judgement on which topic(s) a doc belongs tousually humanIdeally: consistent judgementsAccuracy measurementConfusion matrix53Topic assigned by classifie

24、rActual TopicThis (i, j) entry means 53 of the docs actually intopic i were put in topic j by the classifier.Confusion matrixFunction of classifier, topics and test docs.For a perfect classifier, all off-diagonal entries should be zero.For a perfect classifier, if there are n docs in category j than

25、 entry (j,j) should be n.Straightforward when there is 1 category per document.Can be extended to n categories per document.Confusion measures (1 class / doc)Recall: Fraction of docs in topic i classified correctly:Precision: Fraction of docs assigned topic i that are actually about topic i:“Correct

26、 rate”: (1- error rate) Fraction of docs classified correctly:Integrated Evaluation/OptimizationPrincipled approach to trainingOptimize the measure that performance is measured withs: vector of classifier decision, z: vector of true classesh(s,z) = cost of making decisions s for true assignments zUt

27、ility / CostOne cost function h is based on contingency table.Assume identical cost for all false positives etc.Cost C = l11 * A + l12 *B + l21*C + l22*DFor this cost c, we get the following optimality criterionTruth: yesTruth: noClassifier: yesCost:11Count:ACost:12Count:BClassifier: noCost:21Count;

28、CCost:22Count:DUtility / CostTruth: yesTruth: noClassifier: yes1112Classifier: no2122Most common cost: 1 for error, 0 for correct. Pi ? Product cross-sale: high cost for false positive, low cost for false negative.Patent search: low cost for false positive, high cost for false negative.Are All Optim

29、al Rules of Form p?In the above examples, all you need to do is estimate probability of class membership.Can all problems be solved like this?No!Probability is often not sufficientUser decision depends on the distribution of relevanceExample: information filter for terrorismNave BayesVector Space Cl

30、assificationNearest Neighbor ClassificationRecall Vector Space RepresentationEach doc j is a vector, one component for each term (= word).Normalize to unit length.Have a vector spaceterms are axesn docs live in this spaceeven with stemming, may have 10000+ dimensions, or even 1,000,000+Classificatio

31、n Using Vector SpacesEach training doc a point (vector) labeled by its topic (= class)Hypothesis: docs of the same topic form a contiguous region of spaceDefine surfaces to delineate topics in spaceTopics in a vector spaceGovernmentScienceArtsGiven a test docFigure out which region it lies inAssign

32、corresponding classTest doc = GovernmentGovernmentScienceArtsBinary ClassificationConsider 2 class problemsHow do we define (and find) the separating surface?How do we test which region a test doc is in?Separation by HyperplanesAssume linear separability for now:in 2 dimensions, can separate by a li

33、nein higher dimensions, need hyperplanesCan find separating hyperplane by linear programming (e.g. perceptron):separator can be expressed as ax + by = cLinear programming / PerceptronFind a,b,c, such thatax + by c for red pointsax + by c for green points.Relationship to Nave Bayes?Find a,b,c, such t

34、hatax + by c for red pointsax + by c for green points.Linear ClassifiersMany common text classifiers are linear classifiersDespite this similarity, large performance differencesFor separable problems, there is an infinite number of separating hyperplanes. Which one do you choose?What to do for non-s

35、eparable problems?Which hyperplane?In general, lots of possiblesolutions for a,b,c.Support Vector Machine (SVM)Support vectorsMaximizemarginQuadratic programming problem The decision function is fully specified by subset of training samples, the support vectors.Text classification method du jourTopi

36、c of lecture 9Category: InterestExample SVM features wi ti wi ti 0.70 prime 0.67 rate 0.63 interest 0.60 rates 0.46 discount 0.43 bundesbank 0.43 baker -0.71 dlrs -0.35 world -0.33 sees -0.25 year -0.24 group -0.24 dlr -0.24 januaryMore Than Two ClassesAny-of or multiclass classificationFor n classe

37、s, decompose into n binary problemsOne-of classification: each document belongs to exactly one classHow do we compose separating surfaces into regions?Centroid classificationK nearest neighbor classificationComposing Surfaces: Issues?Separating Multiple TopicsBuild a separator between each topic and

38、 its complementary set (docs from all other topics).Given test doc, evaluate it for membership in each topic.Declare membership in topics One-of classification: for class with maximum score/confidence/probabilityMulticlass classification:For classes above thresholdNegative examplesFormulate as above

39、, except negative examples for a topic are added to its complementary set.Positive examplesNegative examplesCentroid ClassificationGiven training docs for a topic, compute their centroidNow have a centroid for each topicGiven query doc, assign to topic whose centroid is nearest.Exercise: Compare to

40、RocchioExampleGovernmentScienceArtsk Nearest Neighbor ClassificationTo classify document d into class cDefine k-neighborhood N as k nearest neighbors of dCount number of documents l in N that belong to cEstimate P(c|d) as l/kCover and Hart 1967Asymptotically, the error rate of 1-nearest-neighbor classification is less than twice the Bayes

溫馨提示

  • 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
  • 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
  • 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會有圖紙預覽,若沒有圖紙預覽就沒有圖紙。
  • 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
  • 5. 人人文庫網(wǎng)僅提供信息存儲空間,僅對用戶上傳內(nèi)容的表現(xiàn)方式做保護處理,對用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對任何下載內(nèi)容負責。
  • 6. 下載文件中如有侵權(quán)或不適當內(nèi)容,請與我們聯(lián)系,我們立即糾正。
  • 7. 本站不保證下載資源的準確性、安全性和完整性, 同時也不承擔用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。

最新文檔

評論

0/150

提交評論