版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請進行舉報或認領(lǐng)
文檔簡介
貝葉斯分類、KNN、特征選擇、評估陳翀Ref:PengBo@NC&IS@PKU,WBIAlecture統(tǒng)計機器學習與數(shù)據(jù)挖掘技術(shù)與方法研討班講義本次課大綱文本分類TextCategorizationProblemdefinitionNa?veBayesClassifierK-NearestNeighborClassifierEvaluationCategorizationDefinitionGiven:Adescriptionofaninstance,xX,whereXistheinstancelanguageorinstancespace.Issue:howtorepresenttextdocuments.Afixedsetofcategories:
C=
{c1,c2,…,cn}Determine:Thecategoryofx:c(x)C,wherec(x)isacategorizationfunctionwhosedomainisXandwhoserangeisC.Wewanttoknowhowtobuildcategorizationfunctions(“classifiers”).TextCategorizationExamplesAssignlabelstoeachdocumentorweb:LabelsaremostoftentopicssuchasYahoo-categoriese.g.,"finance,""sports,""news>world>asia>business"Labelsmaybegenrese.g.,"editorials""movie-reviews""news“Labelsmaybeopinione.g.,“l(fā)ike”,“hate”,“neutral”Labelsmaybedomain-specificbinarye.g.,"interesting-to-me":"not-interesting-to-me”e.g.,“spam”:“not-spam”e.g.,“containsadultlanguage”:“doesn’t”ClassificationMethodsManualclassificationUsedbyYahoo!,Looksmart,,ODP,MedlineAccuratebutexpensivetoscaleAutomaticdocumentclassificationHand-codedrule-basedsystemsSpammailfilter,…Supervisedlearningofadocument-labelassignmentfunctionNofreelunch:requireshand-classifiedtrainingdataNotethatmanycommercialsystemsuseamixtureofmethodsNa?veBayes統(tǒng)計機器學習與數(shù)據(jù)挖掘技術(shù)與方法研討班講義BernoullitrialBernoullitrialisanexperimentwhoseoutcomeisrandomandcanbeeitheroftwopossibleoutcomes,"success"and"failure".BinomialDistributionbinomialdistributionisthediscreteprobabilitydistributionofthenumberofsuccessesinasequenceofn
independentyes/noexperiments,eachofwhichyieldssuccesswithprobability
p.MultinomialDistributionmultinomialdistributionisageneralizationofthebinomialdistribution.eachtrialresultsinoneofsomefixedfinitenumberkofpossibleoutcomes,withprobabilitiesp1,...,pk,therearen
independenttrials.WecanusearandomvariableXitoindicatethenumberoftimesoutcomenumberiwasobservedoverthentrials.Bayes’RulepriorposteriorlikelihoodUseBayesRuletoGamblesomeonedrawsanenvelopeatrandomandofferstosellittoyou.Howmuchshouldyoupay?beforedeciding,youareallowedtoseeonebeaddrawnfromtheenvelope.Supposeit’sred:Howmuchshouldyoupay?Prosecutor'sfallacyYouwinthelotteryjackpot.Youarethenchargedwithhavingcheated,forinstancewithhavingbribedlotteryofficials.Atthetrial,theprosecutorpointsoutthatwinningthelotterywithoutcheatingisextremelyunlikely,andthatthereforeyourbeinginnocentmustbecomparablyunlikely.LikelihoodalikelihoodfunctionisaconditionalprobabilityfunctionconsideredasafunctionofitssecondargumentwithitsfirstargumentheldfixedGivenaparameterizedfamilyofprobabilitydensityfunctions
Whereθistheparameter,thelikelihoodfunctionis
wherexistheobservedoutcomeofanexperiment.whenf(x|θ)isviewedasafunctionofxwithθfixed,itisaprobabilitydensityfunction,whenviewedasafunctionofθwithxfixed,itisalikelihoodfunction.
MaximumaposterioriHypothesisAsP(D)isconstantMaximumlikelihoodHypothesisIfallhypothesesareaprioriequallylikely,
weonlyneedtoconsidertheP(D|h)term:BayesClassifiersTask:ClassifyanewinstanceDbasedonatupleofattributevaluesintooneoftheclassescj
CNa?veBayesAssumptionP(cj)Canbeestimatedfromthefrequencyofclassesinthetrainingexamples.P(x1,x2,…,xn|cj)O(|X|n?|C|)parametersCouldonlybeestimatedifavery,verylargenumberoftrainingexampleswasavailable.Na?veBayesConditionalIndependenceAssumption:AssumethattheprobabilityofobservingtheconjunctionofattributesisequaltotheproductoftheindividualprobabilitiesP(xi|cj).FluX1X2X5X3X4feversinuscoughrunnynosemuscle-acheTheNa?veBayesClassifierConditionalIndependenceAssumption:featuresdetecttermpresenceandareindependentofeachothergiventheclass:ThismodelisappropriateforbinaryvariablesMultivariatebinomialmodelLearningtheModelFirstattempt:maximumlikelihoodestimatessimplyusethefrequenciesinthedataCX1X2X5X3X4X6Whatifwehaveseennotrainingcaseswherepatienthadnofluandmuscleaches?Zeroprobabilitiescannotbeconditionedaway,nomattertheotherevidence!ProblemwithMaxLikelihoodFluX1X2X5X3X4feversinuscoughrunnynosemuscle-acheSmoothingtoEliminateZeros#ofvaluesofX,Addonesmooth(Laplacesmoothing)Asauniformprior(eachattributeoccursonceforeachclass)thatisthenupdatedasevidencefromthetrainingdatacomesin.DocumentGenerativeModel“Love
is
patient,love
is
kind.“Basic:bagofwordsabinaryindependencemodelMultivariatebinomialgenerationfeatureXiistermValueXi=1or0,indicatetermXipresentindocornotamultinomialunigramlanguagemodelMultinomialgenerationfeatureXiistermpositionValueofXi=termatpositionipositionindependentLoveispatientkind
BernoulliNaiveBayesClassifiersMultivariatebinomialModelOnefeatureXwforeachwordindictionaryXw=trueindocumentdifwappearsindNaiveBayesassumption:Giventhedocument’stopic,appearanceofonewordinthedocumenttellsusnothingaboutchancesthatanotherwordappearsMultinomialNaiveBayesClassifiersIIMultinomial=ClassconditionalunigramOnefeatureXiforeachwordposindocumentfeature’svaluesareallwordsindictionaryValueofXiisthewordinpositioniNa?veBayesassumption:Giventhedocument’stopic,wordinonepositioninthedocumenttellsusnothingaboutwordsinotherpositionsStilltoomanypossibilitiesAssumethatclassificationisindependentofthepositionsofthewordsSecondassumption:WordappearancedoesnotdependonpositionJusthaveonemultinomialfeaturepredictingforallwordsUsesameparametersforeachpositionMultinomialNaiveBayesClassifiersforallpositionsi,j,wordw,andclasscParameterestimationfractionofdocumentsoftopiccjinwhichwordwappearsBinomialmodel:Multinomialmodel:Cancreateamega-documentfortopicjbyconcatenatingalldocumentsinthistopicUsefrequencyofwinmega-documentfractionoftimesinwhichwordwappearsacrossalldocumentsoftopiccjNaiveBayesalgorithm(Multinomialmodel)NaiveBayesalgorithm(Bernoullimodel)NBExamplec(5)=?MultinomialNBClassifierFeaturelikelihoodestimatePosteriorResult:c(5)=ChinaBernoulliNBClassifierFeaturelikelihoodestimatePosteriorResult:c(5)<>ChinaClassificationMultinomialvsMultivariatebinomial?MultinomialisingeneralbetterFeatureSelection統(tǒng)計機器學習與數(shù)據(jù)挖掘技術(shù)與方法研討班講義FeatureSelection:Why?Textcollectionshavealargenumberoffeatures10,000–1,000,000uniquewords…andmoreMaymakeusingaparticularclassifierfeasibleSomeclassifierscan’tdealwith100,000offeaturesReducestrainingtimeTrainingtimeforsomemethodsisquadraticorworseinthenumberoffeaturesCanimprovegeneralization(performance)EliminatesnoisefeaturesAvoidsoverfittingFeatureselection:how?Twoidea:Hypothesistestingstatistics:AreweconfidentthatthevalueofonecategoricalvariableisassociatedwiththevalueofanotherChi-squaretestInformationtheory:HowmuchinformationdoesthevalueofonecategoricalvariablegiveyouaboutthevalueofanotherMutualinformationThey’resimilar,but2measuresconfidenceinassociation,(basedonavailablestatistics),whileMImeasuresextentofassociation(assumingperfectknowledgeofprobabilities)Pearson'schi-squaretestExample:Issix-sideddie"fair“?nullhypothesis:six-sideddie"fair“isfair.therewillbeantheoreticalfrequencyofpossibleoutcomeChi-squareiscalculatedbyOiobservedfrequencyEitheoreticalfrequency2statistic(CHI)nullhypothesis:eventterm,classareindependentThenullhypothesisisrejectedwithconfidence.999
since12.9>10.83(thevaluefor.999confidence).9500500(4.75)(0.25)(9498)3Class
auto(502)2Class=autoTermjaguarTerm=jaguarexpected:feobserved:foExampletheclasspoultryandthetermexportThecountsofthenumberofdocumentswiththefourpossiblecombinations2statistic(CHI)Thereisasimplerformulafor2x22:N=A+B+C+DD=#(?t,?c)B=#(t,?c)C=#(?t,c)A=#(t,c)FeatureselectionviaMutualInformationIntrainingset,choosekwordswhichbestdiscriminate(givemostinfoon)thecategories.MImeasureshowmuchinformationthepresence/absenceofatermcontributestomakingthecorrectclassificationdecisiononc.TheMutualInformationbetweenaword,classis:ExampleFeatureselectionviaMI(contd.)Foreachcategorywebuildalistofkmostdiscriminatingterms.Forexample(on20Newsgroups):sci.electronics:circuit,voltage,amp,ground,copy,battery,electronics,cooling,…rec.autos:car,cars,engine,ford,dealer,mustang,oil,collision,autos,tires,toyota,…Greedy:doesnotaccountforcorrelationsbetweentermsWhy?FeatureSelectionMutualInformationClearinformation-theoreticinterpretationChi-squareStatisticalfoundationMayselectmoreraretermsthanMIJustusethecommonestterms?NoparticularfoundationInpractice,thisisoften90%asgoodFeatureselectionforNBIngeneralfeatureselectionisnecessaryforbinomialNB.Otherwiseyousufferfromnoise,multi-counting“Featureselection”reallymeanssomethingdifferentformultinomialNB.ItmeansdictionarytruncationThemultinomialNBmodelonlyhas1featureK-NearestNeighbors統(tǒng)計機器學習與數(shù)據(jù)挖掘技術(shù)與方法研討班講義Recall:VectorSpaceRepresentationEachdocumentisavector,onecomponentforeachterm(=word).Normalizetounitlength.High-dimensionalvectorspace:Termsareaxes10,000+dimensions,oreven100,000+DocsarevectorsinthisspaceClassesinaVectorSpaceGovernmentScienceArtsClassificationUsingVectorSpacesEachtrainingdocapoint(vector)labeledbyitstopic(=class)Hypothesis:docsofthesameclassformacontiguousregionofspaceWedefinesurfacestodelineateclassesinspaceTestDocument=GovernmentGovernmentScienceArtsSimilarityhypothesistrueingeneral?kNearestNeighborClassificationToclassifydocumentdintoclasscDefinek-neighborhoodNasknearestneighborsofdCountnumberofdocumentsiinNthatbelongtocEstimateP(c|d)asi/kChooseasclassargmaxcP(c|d)[=majorityclass]Example:k=6(6NN)GovernmentScienceArtsP(science|)?Nearest-NeighborLearningAlgorithmLearningisjuststoringtherepresentationsofthetrainingexamplesinD.Testinginstancex:ComputesimilaritybetweenxandallexamplesinD.AssignxthecategoryofthemostsimilarexampleinD.Doesnotexplicitlycomputeageneralizationorcategoryprototypes.Alsocalled:Case-basedlearningMemory-basedlearningLazylearningWhyK?Usingonlytheclosestexampletodeterminethecategorizationissubjecttoerrorsdueto:Asingleatypicalexample.Noise(i.e.error)inthecategorylabelofasingletrainingexample.Morerobustalternativeistofindthekmost-similarexamplesandreturnthemajoritycategoryofthesekexamples.Valueofkistypicallyoddtoavoidties;3and5aremostcommon.kNNdecisionboundariesGovernmentScienceArtsBoundariesareinprinciplearbitrarysurfaces–butusuallypolyhedraSimilarityMetricsNearestneighbormethoddependsonasimilarity(ordistance)metric.Simplestforcontinuousm-dimensionalinstancespaceisEuclideandistance.Simplestform-dimensionalbinaryinstancespaceisHammingdistance(numberoffeaturevaluesthatdiffer).Fortext,cosinesimilarityoftf.idfweightedvectorsistypicallymosteffective.Illustrationof3NearestNeighborforTextVectorSpaceNearestNeighborwithInvertedIndexNaivelyfindingnearestneighborsrequiresalinearsearchthrough|D|documentsincollectionButdeterminingknearestneighborsisthesameasdeterminingthekbestretrievalsusingthetestdocumentasaquerytoadatabaseoftrainingdocuments.Usestandardvectorspaceinvertedindexmethodstofindtheknearestneighbors.TestingTime:O(B|Vt|)whereBistheaveragenumberoftrainingdocumentsinwhichatest-documentwordappears.TypicallyB<<|D|kNN:DiscussionNofeatureselectionnecessaryScaleswellwithlargenumberofclassesDon’tneedtotrainnclassifiersfornclassesClassescaninfluenceeachotherSmallchangestooneclasscanhaverippleeffectScorescanbehardtoconverttoprobabilitiesNotrainingnecessaryBiasvs.variance:
ChoosingthecorrectmodelcapacitykNNvs.NaiveBayesBias/VariancetradeoffVariance≈CapacitykNNhashighvarianceandlowbias.InfinitememoryNBhaslowvarianceandhighbias.Decisionsurfacehastobelinear(hyperplane)SummaryCategorizationTrainingdataOver-fitting&GeneralizeNa?veBayesBayesianMethodsBernoulli
NBclassifierMultinomialNBclassifierK-NearestNeighborBias.vs.VarianceFeatureselectionChi-squaretestMutualInformationReadings[1]IIRCh13,Ch14.2[2]Y.YangandX.Liu,"Are-examinationoftextcategorizationmethods,"presentedatProceedingsofACMSIGIRConferenceonResearchandDevelopmentinInformationRetrieval(SIGIR'99),1999.ClassificationEvaluation統(tǒng)計機器學習與數(shù)據(jù)挖掘技術(shù)與方法研討班講義Most(over)useddataset21578documents9603training,3299testarticles(ModAptesplit)118categoriesAnarticlecanbeinmorethanonecategoryLearn118binarycategorydistinctionsAveragedocument:about90types,200tokensAveragenumberofclassesassigned1.24fordocswithatleastonecategoryOnlyabout10outof118categoriesarelargeCommoncategories(#train,#test)Evaluation:ClassicReutersDataSet
Earn(2877,1087)Acquisitions(1650,179)Money-fx(538,179)Grain(433,149)Crude(389,189)Trade(369,119)Interest(347,131)Ship(197,89)Wheat(212,71)Corn(182,56)ReutersTextCategorizationdataset(Reuters-21578)
document<REUTERSTOPICS="YES"LEWISSPLIT="TRAIN"CGISPLIT="TRAINING-SET"OLDID="12981"NEWID="798"><DATE>2-MAR-198716:51:43.42</DATE><TOPICS><D>livestock</D><D>hog</D></TOPICS><TITLE>AMERICANPORKCONGRESSKICKSOFFTOMORROW</TITLE><DATELINE>CHICAGO,March2-</DATELINE><BODY>TheAmericanPorkCongresskicksofftomorrow,March3,inIndianapoliswith160ofthenationsporkproducersfrom44memberstatesdeterminingindustrypositionsonanumberofissues,accordingtotheNationalPorkProducersCouncil,NPPC.DelegatestothethreedayCongresswillbeconsidering26resolutionsconcerningvariousissues,includingthefuturedirectionoffarmpolicyandthetaxlawasitappliestotheagriculturesector.ThedelegateswillalsodebatewhethertoendorseconceptsofanationalPRV(pseudorabiesvirus)controlanderadicationprogram,theNPPCsaid.Alargetradeshow,inconjunctionwiththecongress,willfeaturethelatestintechnologyinallareasoftheindustry,theNPPCadded.Reuter</BODY></TEXT></REUTERS>NewReuters:RCV1:810,000docsToptopicsinReutersRCV1北大天網(wǎng):中文網(wǎng)頁分類
通過動員不同專業(yè)的幾十個學生,人工選取形成了一個基于層次模型的大規(guī)模中文網(wǎng)頁樣本集。包括12,336個訓(xùn)練網(wǎng)頁實例和3,269個測試網(wǎng)頁實例,分布在12個大類,共計733個類別中,每個類別平均有17個訓(xùn)練實例和4.5個測試實例天網(wǎng)免費提供網(wǎng)頁樣本集給有興趣的同行,燕穹產(chǎn)品號:YQ-WEBENCH-V0.8中文信息檢索論壇全國搜索引擎和網(wǎng)上信息挖掘?qū)W術(shù)研討會SEWM上進行分類評測北大天網(wǎng):中文網(wǎng)頁分類表11-1樣本集中類別及實例數(shù)量的分布情況表類別編號類別名稱類別數(shù)訓(xùn)練樣本數(shù)測試樣本數(shù)1人文與藝術(shù)244191102新聞與媒體7125193商業(yè)與經(jīng)濟488392144娛樂與休閑8815103745計算機與因特網(wǎng)589252386教育18286857各國風情538912358自然科學11318925149政府與政治182888410社會科學104176547911醫(yī)療與健康
136229561612社會與文化661101301共計733123363269
MeasuringClassificationFiguresofMeritAccuracyofclassificationMainevaluationcriterioninacademiaMoreinamomentSpeedoftrainingstatisticalclassifierSomemethodsareverycheap;someverycostlySpeedofclassification(docs/hour)NobigdifferencesformostalgorithmsExceptions:kNN,complexpreprocessingrequirementsEffortincreatingtrainingset/hand-builtclassifierhumanhours/topicMeasuringClassificationFiguresofMeritIntherealworld,economicmeasures:Yourchoicesare:DonoclassificationThathasacost(hardtocompute)DoitallmanuallyHasaneasytocomputecostifdoingitlikethatnowDoitallwithanautomaticclassifierMistakeshaveacostDoitwithacombinationofautomaticclassificationandmanualreviewofuncertain/difficult/”new”casesCommonlythelastmethodismostcostefficientandisadoptedConceptDriftCategorieschangeovertimeExample:“presidentoftheunitedstates”1999:clintonisgreatfeature2002:clintonisbadfeatureOnemeasureofatextclassificationsystemishowwellitprotectsagainstconceptdrift.Featureselection:canbebadinprotectingagainstconceptdriftPerclassevaluationmeasuresRecall:Fractionofdocsinclassiclassifiedcorrectly:Precision:Fractionofdocsassignedclassithatareactuallyaboutclassi:“Correctrate”:(1-errorrate)Fractionofdocsclassifiedcorrectly:ABCABCActualClassPredicted
classMeasuresofAccuracyEvaluationmustbedoneontestdatathatisindependentofthetrainingdataOverallerrorrateNotagoodmeasureforsmallclasses.Why?
溫馨提示
- 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
- 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
- 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
- 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
- 5. 人人文庫網(wǎng)僅提供信息存儲空間,僅對用戶上傳內(nèi)容的表現(xiàn)方式做保護處理,對用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對任何下載內(nèi)容負責。
- 6. 下載文件中如有侵權(quán)或不適當內(nèi)容,請與我們聯(lián)系,我們立即糾正。
- 7. 本站不保證下載資源的準確性、安全性和完整性, 同時也不承擔用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。
最新文檔
- 房屋買賣合同詳細信息
- 個人借款還款合同格式
- 乳膠漆銷售購銷合同
- 提升衛(wèi)生質(zhì)量承諾保證書
- 監(jiān)理招標文件版速遞
- 精裝房買賣合同模板
- 招標文件中的超值采購項目
- 農(nóng)產(chǎn)品批量購銷合同
- 招標文件中的重要采購項目
- 酒會活動承包合同
- 2025年中考數(shù)學備考計劃
- 內(nèi)蒙古包頭市青山區(qū)2023-2024學年七年級上學期期末調(diào)研檢測數(shù)學試卷(含解析)
- 高層建筑用電安全管理制度
- 2024學校安全工作總結(jié)
- 2024-2025學年語文二年級上冊統(tǒng)編版期末測試卷(含答案)
- 2024年世界職業(yè)院校技能大賽高職組“關(guān)務(wù)實務(wù)組”賽項參考試題庫(含答案)
- 6《記念劉和珍君》《為了忘卻的紀念》說課稿 2024-2025學年統(tǒng)編版高中語文選擇性必修中冊
- 智能化住宅小區(qū)施工合同
- 大學物業(yè)服務(wù)月考核評價評分表
- 軸線翻身法操作
- 福建師范大學《歌曲寫作》2023-2024學年第一學期期末試卷
評論
0/150
提交評論