




版權說明:本文檔由用戶提供并上傳,收益歸屬內容提供方,若內容存在侵權,請進行舉報或認領
文檔簡介
IntroductiontoInformationRetrieval
IntroductiontoInformationRetrievalCS276:InformationRetrievalandWebSearchTextClassification1ChrisManning,PanduNayakandPrabhakarRaghavanIntroductiontoInformationRetrieval
PrepworkThislecturepresumesthatyou’veseenthe124courseralectureonNa?veBayes,orequivalentWillrefertoNBwithoutdescribingitCh.13IntroductiontoInformationRetrieval
StandingqueriesThepathfromIRtotextclassification:Youhaveaninformationneedtomonitor,say:UnrestintheNigerdeltaregionYouwanttorerunanappropriatequeryperiodicallytofindnewnewsitemsonthistopicYouwillbesentnewdocumentsthatarefoundI.e.,it’snotrankingbutclassification(relevantvs.notrelevant)SuchqueriesarecalledstandingqueriesLongusedby“informationprofessionals”AmodernmassinstantiationisGoogleAlertsStandingqueriesare(hand-written)textclassifiersCh.13IntroductiontoInformationRetrieval
3IntroductiontoInformationRetrieval
Spamfiltering
AnothertextclassificationtaskFrom:""<takworlld@>Subject:realestateistheonlyway...gemoalvgkayAnyonecanbuyrealestatewithnomoneydownStoppayingrentTODAY!ThereisnoneedtospendhundredsoreventhousandsforsimilarcoursesIam22yearsoldandIhavealreadypurchased6propertiesusingthemethodsoutlinedinthistrulyINCREDIBLEebook.ChangeyourlifeNOW!=================================================ClickBelowtoorder:/sales/nmd.htm=================================================Ch.13IntroductiontoInformationRetrieval
Categorization/ClassificationGiven:ArepresentationofadocumentdIssue:howtorepresenttextdocuments.Usuallysometypeofhigh-dimensionalspace–bagofwordsAfixedsetofclasses:
C={c1,c2,…,cJ}Determine:Thecategoryofd:γ(d)∈C,whereγ(d)isaclassificationfunctionWewanttobuildclassificationfunctions(“classifiers”).Sec.13.1IntroductiontoInformationRetrieval
MultimediaGUIGarb.Coll.SemanticsMLPlanningplanningtemporalreasoningplanlanguage...programmingsemanticslanguageproof...learningintelligencealgorithmreinforcementnetwork...garbagecollectionmemoryoptimizationregion...“planninglanguageproofintelligence”TrainingData:TestData:Classes:(AI)DocumentClassification(Programming)(HCI)......Sec.13.1IntroductiontoInformationRetrieval
ClassificationMethods(1)ManualclassificationUsedbytheoriginalYahoo!DirectoryLooksmart,,ODP,PubMedAccuratewhenjobisdonebyexpertsConsistentwhentheproblemsizeandteamissmallDifficultandexpensivetoscaleMeansweneedautomaticclassificationmethodsforbigproblemsCh.13IntroductiontoInformationRetrieval
ClassificationMethods(2)Hand-codedrule-basedclassifiersOnetechniqueusedbynewagencies,intelligenceagencies,etc.WidelydeployedingovernmentandenterpriseVendorsprovide“IDE”forwritingsuchrulesCh.13IntroductiontoInformationRetrieval
ClassificationMethods(2)Hand-codedrule-basedclassifiersCommercialsystemshavecomplexquerylanguagesAccuracyiscanbehighifarulehasbeencarefullyrefinedovertimebyasubjectexpertBuildingandmaintainingtheserulesisexpensiveCh.13IntroductiontoInformationRetrieval
AVeritytopic
AcomplexclassificationruleNote:maintenanceissues(author,etc.)Hand-weightingofterms[VeritywasboughtbyAutonomy,whichwasboughtbyHP...]Ch.13IntroductiontoInformationRetrieval
ClassificationMethods(3):
SupervisedlearningGiven:AdocumentdAfixedsetofclasses:
C={c1,c2,…,cJ}Atrainingset
DofdocumentseachwithalabelinCDetermine:AlearningmethodoralgorithmwhichwillenableustolearnaclassifierγForatestdocumentd,weassignittheclassγ(d)∈CSec.13.1IntroductiontoInformationRetrieval
ClassificationMethods(3)SupervisedlearningNaiveBayes(simple,common)–seevideok-NearestNeighbors(simple,powerful)Support-vectormachines(new,generallymorepowerful)…plusmanyothermethodsNofreelunch:requireshand-classifiedtrainingdataButdatacanbebuiltup(andrefined)byamateursManycommercialsystemsuseamixtureofmethodsCh.13IntroductiontoInformationRetrieval
ThebagofwordsrepresentationIlovethismovie!It'ssweet,butwithsatiricalhumor.Thedialogueisgreatandtheadventurescenesarefun…Itmanagestobewhimsicalandromanticwhilelaughingattheconventionsofthefairytalegenre.Iwouldrecommendittojustaboutanyone.I'veseenitseveraltimes,andI'malwayshappytoseeitagainwheneverIhaveafriendwhohasn'tseenityet.γ()=cIntroductiontoInformationRetrieval
Thebagofwordsrepresentationγ()=cgreat2love2recommend1laugh1happy1......IntroductiontoInformationRetrieval
FeaturesSupervisedlearningclassifierscanuseanysortoffeatureURL,emailaddress,punctuation,capitalization,dictionaries,networkfeaturesInthebagofwordsviewofdocumentsWeuseonlywordfeaturesweuseallofthewordsinthetext(notasubset)IntroductiontoInformationRetrieval
FeatureSelection:Why?Textcollectionshavealargenumberoffeatures10,000–1,000,000uniquewords…andmoreSelectionmaymakeaparticularclassifierfeasibleSomeclassifierscan’tdealwith1,000,000featuresReducestrainingtimeTrainingtimeforsomemethodsisquadraticorworseinthenumberoffeaturesMakesruntimemodelssmallerandfasterCanimprovegeneralization(performance)EliminatesnoisefeaturesAvoidsoverfittingSec.13.5IntroductiontoInformationRetrieval
FeatureSelection:FrequencyThesimplestfeatureselectionmethod:JustusethecommonesttermsNoparticularfoundationButitmakesensewhythisworksThey’rethewordsthatcanbewell-estimatedandaremostoftenavailableasevidenceInpractice,thisisoften90%asgoodasbettermethodsSmarterfeatureselection–futurelectureIntroductiontoInformationRetrieval
EvaluatingCategorizationEvaluationmustbedoneontestdatathatareindependentofthetrainingdataSometimesusecross-validation(averagingresultsovermultipletrainingandtestsplitsoftheoveralldata)Easytogetgoodperformanceonatestsetthatwasavailabletothelearnerduringtraining(e.g.,justmemorizethetestset)Sec.13.6IntroductiontoInformationRetrieval
EvaluatingCategorizationMeasures:precision,recall,F1,classificationaccuracyClassificationaccuracy:r/nwherenisthetotalnumberoftestdocsandristhenumberoftestdocscorrectlyclassifiedSec.13.6IntroductiontoInformationRetrieval
WebKBExperiment(1998)ClassifywebpagesfromCSdepartmentsinto:student,faculty,course,projectTrainon~5,000hand-labeledwebpagesCornell,Washington,U.Texas,WisconsinCrawlandclassifyanewsite(CMU)usingNa?veBayesResultsSec.13.6IntroductiontoInformationRetrieval
IntroductiontoInformationRetrieval
SpamAssassinNa?veBayeshasfoundahomeinspamfilteringPaulGraham’sAPlanforSpamWidelyusedinspamfiltersButmanyfeaturesbeyondwords:blackholelists,etc.particularhand-craftedtextpatternsIntroductiontoInformationRetrieval
SpamAssassinFeatures:Basic(Na?ve)BayesspamprobabilityMentions:GenericViagraRegex:millionsof(dollar)((dollar)NN,NNN,NNN.NN)Phrase:impress...girlPhrase:‘PrestigiousNon-AccreditedUniversities’From:startswithmanynumbersSubjectisallcapitalsHTMLhasalowratiooftexttoimageareaRelayinRBL,/enduserinfo_rbl.htmlRCVDlinelooksfaked/tests_3_3_x.htmlIntroductiontoInformationRetrieval
NaiveBayesisNotSoNaiveVeryfastlearningandtesting(basicallyjustcountwords)LowstoragerequirementsVerygoodindomainswithmanyequallyimportantfeaturesMorerobusttoirrelevantfeaturesthanmanylearningmethods IrrelevantfeaturescanceleachotherwithoutaffectingresultsIntroductiontoInformationRetrieval
NaiveBayesisNotSoNaiveMorerobusttoconceptdrift(changingclassdefinitionovertime)NaiveBayeswon1stand2ndplaceinKDD-CUP97competitionoutof16systems Goal:Financialservicesindustrydirectmailresponseprediction:Predictiftherecipientofmailwillactuallyrespondtotheadvertisement–750,000records.Agooddependablebaselinefortextclassification(butnotthebest)!IntroductiontoInformationRetrieval
ClassificationUsingVectorSpacesInvectorspaceclassification,trainingsetcorrespondstoalabeledsetofpoints(equivalently,vectors)Premise1:DocumentsinthesameclassformacontiguousregionofspacePremise2:Documentsfromdifferentclassesdon’toverlap(much)Learningaclassifier:buildsurfacestodelineateclassesinthespace28DocumentsinaVectorSpaceGovernmentScienceArtsSec.14.129TestDocumentofwhatclass?GovernmentScienceArtsSec.14.130TestDocument=GovernmentGovernmentScienceArtsIsthissimilarityhypothesistrueingeneral?Ourfocus:howtofindgoodseparatorsSec.14.1DefinitionofcentroidWhereDc
isthesetofalldocumentsthatbelongtoclasscandv(d)isthevectorspacerepresentationofd.Notethatcentroidwillingeneralnotbeaunitvectorevenwhentheinputsareunitvectors.31Sec.14.2RocchioclassificationRocchioformsasimplerepresentativeforeachclass:thecentroid/prototypeClassification:nearestprototype/centroidItdoesnotguaranteethatclassificationsareconsistentwiththegiventrainingdata32Sec.14.2RocchioclassificationLittleusedoutsidetextclassificationIthasbeenusedquiteeffectivelyfortextclassificationButingeneralworsethanNa?veBayesAgain,cheaptotrainandtestdocuments33Sec.14.234kNearestNeighborClassificationkNN=kNearestNeighborToclassifyadocumentd:Definek-neighborhoodastheknearestneighborsofdPickthemajorityclasslabelinthek-neighborhoodSec.14.335Example:k=6(6NN)GovernmentScienceArtsP(science|)?Sec.14.336Nearest-NeighborLearningLearning:juststorethelabeledtrainingexamplesDTestinginstancex(under1NN):ComputesimilaritybetweenxandallexamplesinD.AssignxthecategoryofthemostsimilarexampleinD.DoesnotcomputeanythingbeyondstoringtheexamplesAlsocalled:Case-basedlearningMemory-basedlearningLazylearningRationaleofkNN:contiguityhypothesisSec.14.337kNearestNeighborUsingonlytheclosestexample(1NN)subjecttoerrorsdueto:Asingleatypicalexample.Noise(i.e.,anerror)inthecategorylabelofasingletrainingexample.Morerobust:findthekexamplesandreturnthemajoritycategoryofthesekkistypicallyoddtoavoidties;3and5aremostcommonSec.14.338kNNdecisionboundariesGovernmentScienceArtsBoundariesareinprinciplearbitrarysurfaces–butusuallypolyhedrakNNgiveslocallydefineddecisionboundariesbetweenclasses–farawaypointsdonotinfluenceeachclassificationdecision(unlikeinNa?veBayes,Rocchio,etc.)Sec.14.339Illustrationof3NearestNeighborforTextVectorSpaceSec.14.3403NearestNeighborvs.RocchioNearestNeighbortendstohandlepolymorphiccategoriesbetterthanRocchio/NB.41kNN:DiscussionNofeatureselectionnecessaryNotrainingnecessaryScaleswellwithlargenumberofclassesDon’tneedtotrainnclassifiersfornclassesClassescaninfluenceeachotherSmallchangestooneclasscanhaverippleeffectMaybeexpensiveattesttimeInmostcasesit’smoreaccuratethanNBorRocchioSec.14.3Let’stestourintuitionCanabagofwordsalwaysbeviewedasavectorspace?Whataboutabagoffeatures?Canwealwaysviewastandingqueryasaregioninavectorspace?WhataboutBooleanqueriesonterms?Whatdo“rectangles”equateto?4243Biasvs.capacity–notionsandterminologyConsideraskingabotanist:Isanobjectatree?
溫馨提示
- 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
- 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯系上傳者。文件的所有權益歸上傳用戶所有。
- 3. 本站RAR壓縮包中若帶圖紙,網頁內容里面會有圖紙預覽,若沒有圖紙預覽就沒有圖紙。
- 4. 未經權益所有人同意不得將文件中的內容挪作商業(yè)或盈利用途。
- 5. 人人文庫網僅提供信息存儲空間,僅對用戶上傳內容的表現方式做保護處理,對用戶上傳分享的文檔內容本身不做任何修改或編輯,并不能對任何下載內容負責。
- 6. 下載文件中如有侵權或不適當內容,請與我們聯系,我們立即糾正。
- 7. 本站不保證下載資源的準確性、安全性和完整性, 同時也不承擔用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。
最新文檔
- 新人教版高中語文必修2近代科學進入中國的回顧與前瞻 同步練習1
- 高中語文第二冊赤壁賦 同步練習1
- 高二上冊語文(人教版)夢游天姥吟留別閱讀高速路 同步閱讀
- 修整祠堂合同范例
- 個人砂石料采購合同范本
- nk細胞研發(fā)合同范例
- 個人對公材料合同范例
- 人力公司服務合同范例
- 分公司協議合同范例
- 代理報稅公司合同范例
- 危險作業(yè)監(jiān)護人資格考試
- 合同協議公司員工聘用合同7篇
- 2025年安徽衛(wèi)生健康職業(yè)學院單招職業(yè)適應性測試題庫含答案
- 2025年安徽電子信息職業(yè)技術學院單招職業(yè)傾向性考試題庫新版
- 2025年常州信息職業(yè)技術學院單招職業(yè)技能考試題庫審定版
- 2025上海崇明現代農業(yè)園區(qū)開發(fā)限公司招聘39人易考易錯模擬試題(共500題)試卷后附參考答案
- 老年肺炎臨床診斷與治療專家共識(2024年版)解讀
- 4.1 人要有自信 (課件)2024-2025學年七年級道德與法治下冊(統編版2024)
- 護理隨訪案例分享課件
- 天然產物藥物生物合成
- 中國HEPA過濾器行業(yè)發(fā)展監(jiān)測及發(fā)展戰(zhàn)略規(guī)劃報告
評論
0/150
提交評論