




版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請進行舉報或認領(lǐng)
文檔簡介
IntroductiontoInformationRetrieval
IntroductiontoInformationRetrievalCS276:InformationRetrievalandWebSearchTextClassification1ChrisManning,PanduNayakandPrabhakarRaghavanIntroductiontoInformationRetrieval
PrepworkThislecturepresumesthatyou’veseenthe124courseralectureonNa?veBayes,orequivalentWillrefertoNBwithoutdescribingitCh.13IntroductiontoInformationRetrieval
StandingqueriesThepathfromIRtotextclassification:Youhaveaninformationneedtomonitor,say:UnrestintheNigerdeltaregionYouwanttorerunanappropriatequeryperiodicallytofindnewnewsitemsonthistopicYouwillbesentnewdocumentsthatarefoundI.e.,it’snotrankingbutclassification(relevantvs.notrelevant)SuchqueriesarecalledstandingqueriesLongusedby“informationprofessionals”AmodernmassinstantiationisGoogleAlertsStandingqueriesare(hand-written)textclassifiersCh.13IntroductiontoInformationRetrieval
3IntroductiontoInformationRetrieval
Spamfiltering
AnothertextclassificationtaskFrom:""<takworlld@>Subject:realestateistheonlyway...gemoalvgkayAnyonecanbuyrealestatewithnomoneydownStoppayingrentTODAY!ThereisnoneedtospendhundredsoreventhousandsforsimilarcoursesIam22yearsoldandIhavealreadypurchased6propertiesusingthemethodsoutlinedinthistrulyINCREDIBLEebook.ChangeyourlifeNOW!=================================================ClickBelowtoorder:/sales/nmd.htm=================================================Ch.13IntroductiontoInformationRetrieval
Categorization/ClassificationGiven:ArepresentationofadocumentdIssue:howtorepresenttextdocuments.Usuallysometypeofhigh-dimensionalspace–bagofwordsAfixedsetofclasses:
C={c1,c2,…,cJ}Determine:Thecategoryofd:γ(d)∈C,whereγ(d)isaclassificationfunctionWewanttobuildclassificationfunctions(“classifiers”).Sec.13.1IntroductiontoInformationRetrieval
MultimediaGUIGarb.Coll.SemanticsMLPlanningplanningtemporalreasoningplanlanguage...programmingsemanticslanguageproof...learningintelligencealgorithmreinforcementnetwork...garbagecollectionmemoryoptimizationregion...“planninglanguageproofintelligence”TrainingData:TestData:Classes:(AI)DocumentClassification(Programming)(HCI)......Sec.13.1IntroductiontoInformationRetrieval
ClassificationMethods(1)ManualclassificationUsedbytheoriginalYahoo!DirectoryLooksmart,,ODP,PubMedAccuratewhenjobisdonebyexpertsConsistentwhentheproblemsizeandteamissmallDifficultandexpensivetoscaleMeansweneedautomaticclassificationmethodsforbigproblemsCh.13IntroductiontoInformationRetrieval
ClassificationMethods(2)Hand-codedrule-basedclassifiersOnetechniqueusedbynewagencies,intelligenceagencies,etc.WidelydeployedingovernmentandenterpriseVendorsprovide“IDE”forwritingsuchrulesCh.13IntroductiontoInformationRetrieval
ClassificationMethods(2)Hand-codedrule-basedclassifiersCommercialsystemshavecomplexquerylanguagesAccuracyiscanbehighifarulehasbeencarefullyrefinedovertimebyasubjectexpertBuildingandmaintainingtheserulesisexpensiveCh.13IntroductiontoInformationRetrieval
AVeritytopic
AcomplexclassificationruleNote:maintenanceissues(author,etc.)Hand-weightingofterms[VeritywasboughtbyAutonomy,whichwasboughtbyHP...]Ch.13IntroductiontoInformationRetrieval
ClassificationMethods(3):
SupervisedlearningGiven:AdocumentdAfixedsetofclasses:
C={c1,c2,…,cJ}Atrainingset
DofdocumentseachwithalabelinCDetermine:AlearningmethodoralgorithmwhichwillenableustolearnaclassifierγForatestdocumentd,weassignittheclassγ(d)∈CSec.13.1IntroductiontoInformationRetrieval
ClassificationMethods(3)SupervisedlearningNaiveBayes(simple,common)–seevideok-NearestNeighbors(simple,powerful)Support-vectormachines(new,generallymorepowerful)…plusmanyothermethodsNofreelunch:requireshand-classifiedtrainingdataButdatacanbebuiltup(andrefined)byamateursManycommercialsystemsuseamixtureofmethodsCh.13IntroductiontoInformationRetrieval
ThebagofwordsrepresentationIlovethismovie!It'ssweet,butwithsatiricalhumor.Thedialogueisgreatandtheadventurescenesarefun…Itmanagestobewhimsicalandromanticwhilelaughingattheconventionsofthefairytalegenre.Iwouldrecommendittojustaboutanyone.I'veseenitseveraltimes,andI'malwayshappytoseeitagainwheneverIhaveafriendwhohasn'tseenityet.γ()=cIntroductiontoInformationRetrieval
Thebagofwordsrepresentationγ()=cgreat2love2recommend1laugh1happy1......IntroductiontoInformationRetrieval
FeaturesSupervisedlearningclassifierscanuseanysortoffeatureURL,emailaddress,punctuation,capitalization,dictionaries,networkfeaturesInthebagofwordsviewofdocumentsWeuseonlywordfeaturesweuseallofthewordsinthetext(notasubset)IntroductiontoInformationRetrieval
FeatureSelection:Why?Textcollectionshavealargenumberoffeatures10,000–1,000,000uniquewords…andmoreSelectionmaymakeaparticularclassifierfeasibleSomeclassifierscan’tdealwith1,000,000featuresReducestrainingtimeTrainingtimeforsomemethodsisquadraticorworseinthenumberoffeaturesMakesruntimemodelssmallerandfasterCanimprovegeneralization(performance)EliminatesnoisefeaturesAvoidsoverfittingSec.13.5IntroductiontoInformationRetrieval
FeatureSelection:FrequencyThesimplestfeatureselectionmethod:JustusethecommonesttermsNoparticularfoundationButitmakesensewhythisworksThey’rethewordsthatcanbewell-estimatedandaremostoftenavailableasevidenceInpractice,thisisoften90%asgoodasbettermethodsSmarterfeatureselection–futurelectureIntroductiontoInformationRetrieval
EvaluatingCategorizationEvaluationmustbedoneontestdatathatareindependentofthetrainingdataSometimesusecross-validation(averagingresultsovermultipletrainingandtestsplitsoftheoveralldata)Easytogetgoodperformanceonatestsetthatwasavailabletothelearnerduringtraining(e.g.,justmemorizethetestset)Sec.13.6IntroductiontoInformationRetrieval
EvaluatingCategorizationMeasures:precision,recall,F1,classificationaccuracyClassificationaccuracy:r/nwherenisthetotalnumberoftestdocsandristhenumberoftestdocscorrectlyclassifiedSec.13.6IntroductiontoInformationRetrieval
WebKBExperiment(1998)ClassifywebpagesfromCSdepartmentsinto:student,faculty,course,projectTrainon~5,000hand-labeledwebpagesCornell,Washington,U.Texas,WisconsinCrawlandclassifyanewsite(CMU)usingNa?veBayesResultsSec.13.6IntroductiontoInformationRetrieval
IntroductiontoInformationRetrieval
SpamAssassinNa?veBayeshasfoundahomeinspamfilteringPaulGraham’sAPlanforSpamWidelyusedinspamfiltersButmanyfeaturesbeyondwords:blackholelists,etc.particularhand-craftedtextpatternsIntroductiontoInformationRetrieval
SpamAssassinFeatures:Basic(Na?ve)BayesspamprobabilityMentions:GenericViagraRegex:millionsof(dollar)((dollar)NN,NNN,NNN.NN)Phrase:impress...girlPhrase:‘PrestigiousNon-AccreditedUniversities’From:startswithmanynumbersSubjectisallcapitalsHTMLhasalowratiooftexttoimageareaRelayinRBL,/enduserinfo_rbl.htmlRCVDlinelooksfaked/tests_3_3_x.htmlIntroductiontoInformationRetrieval
NaiveBayesisNotSoNaiveVeryfastlearningandtesting(basicallyjustcountwords)LowstoragerequirementsVerygoodindomainswithmanyequallyimportantfeaturesMorerobusttoirrelevantfeaturesthanmanylearningmethods IrrelevantfeaturescanceleachotherwithoutaffectingresultsIntroductiontoInformationRetrieval
NaiveBayesisNotSoNaiveMorerobusttoconceptdrift(changingclassdefinitionovertime)NaiveBayeswon1stand2ndplaceinKDD-CUP97competitionoutof16systems Goal:Financialservicesindustrydirectmailresponseprediction:Predictiftherecipientofmailwillactuallyrespondtotheadvertisement–750,000records.Agooddependablebaselinefortextclassification(butnotthebest)!IntroductiontoInformationRetrieval
ClassificationUsingVectorSpacesInvectorspaceclassification,trainingsetcorrespondstoalabeledsetofpoints(equivalently,vectors)Premise1:DocumentsinthesameclassformacontiguousregionofspacePremise2:Documentsfromdifferentclassesdon’toverlap(much)Learningaclassifier:buildsurfacestodelineateclassesinthespace28DocumentsinaVectorSpaceGovernmentScienceArtsSec.14.129TestDocumentofwhatclass?GovernmentScienceArtsSec.14.130TestDocument=GovernmentGovernmentScienceArtsIsthissimilarityhypothesistrueingeneral?Ourfocus:howtofindgoodseparatorsSec.14.1DefinitionofcentroidWhereDc
isthesetofalldocumentsthatbelongtoclasscandv(d)isthevectorspacerepresentationofd.Notethatcentroidwillingeneralnotbeaunitvectorevenwhentheinputsareunitvectors.31Sec.14.2RocchioclassificationRocchioformsasimplerepresentativeforeachclass:thecentroid/prototypeClassification:nearestprototype/centroidItdoesnotguaranteethatclassificationsareconsistentwiththegiventrainingdata32Sec.14.2RocchioclassificationLittleusedoutsidetextclassificationIthasbeenusedquiteeffectivelyfortextclassificationButingeneralworsethanNa?veBayesAgain,cheaptotrainandtestdocuments33Sec.14.234kNearestNeighborClassificationkNN=kNearestNeighborToclassifyadocumentd:Definek-neighborhoodastheknearestneighborsofdPickthemajorityclasslabelinthek-neighborhoodSec.14.335Example:k=6(6NN)GovernmentScienceArtsP(science|)?Sec.14.336Nearest-NeighborLearningLearning:juststorethelabeledtrainingexamplesDTestinginstancex(under1NN):ComputesimilaritybetweenxandallexamplesinD.AssignxthecategoryofthemostsimilarexampleinD.DoesnotcomputeanythingbeyondstoringtheexamplesAlsocalled:Case-basedlearningMemory-basedlearningLazylearningRationaleofkNN:contiguityhypothesisSec.14.337kNearestNeighborUsingonlytheclosestexample(1NN)subjecttoerrorsdueto:Asingleatypicalexample.Noise(i.e.,anerror)inthecategorylabelofasingletrainingexample.Morerobust:findthekexamplesandreturnthemajoritycategoryofthesekkistypicallyoddtoavoidties;3and5aremostcommonSec.14.338kNNdecisionboundariesGovernmentScienceArtsBoundariesareinprinciplearbitrarysurfaces–butusuallypolyhedrakNNgiveslocallydefineddecisionboundariesbetweenclasses–farawaypointsdonotinfluenceeachclassificationdecision(unlikeinNa?veBayes,Rocchio,etc.)Sec.14.339Illustrationof3NearestNeighborforTextVectorSpaceSec.14.3403NearestNeighborvs.RocchioNearestNeighbortendstohandlepolymorphiccategoriesbetterthanRocchio/NB.41kNN:DiscussionNofeatureselectionnecessaryNotrainingnecessaryScaleswellwithlargenumberofclassesDon’tneedtotrainnclassifiersfornclassesClassescaninfluenceeachotherSmallchangestooneclasscanhaverippleeffectMaybeexpensiveattesttimeInmostcasesit’smoreaccuratethanNBorRocchioSec.14.3Let’stestourintuitionCanabagofwordsalwaysbeviewedasavectorspace?Whataboutabagoffeatures?Canwealwaysviewastandingqueryasaregioninavectorspace?WhataboutBooleanqueriesonterms?Whatdo“rectangles”equateto?4243Biasvs.capacity–notionsandterminologyConsideraskingabotanist:Isanobjectatree?
溫馨提示
- 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
- 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
- 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
- 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
- 5. 人人文庫網(wǎng)僅提供信息存儲空間,僅對用戶上傳內(nèi)容的表現(xiàn)方式做保護處理,對用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對任何下載內(nèi)容負責(zé)。
- 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請與我們聯(lián)系,我們立即糾正。
- 7. 本站不保證下載資源的準確性、安全性和完整性, 同時也不承擔(dān)用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。
最新文檔
- 財務(wù)戰(zhàn)略規(guī)劃能力試題及答案2025
- 項目管理資格考試信息分析試題及答案
- 項目組合管理的有效策略與技術(shù)考核試題及答案
- 抗生素抵抗機制的檢測試題及答案
- 清晰的項目目標設(shè)定與達成路徑試題及答案
- 2025年證券從業(yè)資格考試劣勢與突破方式試題及答案
- 闡述理財理念對客戶的影響2025年國際金融理財師考試試題及答案
- 2025銀行從業(yè)資格考試復(fù)習(xí)計劃制定試題及答案
- 課題申報書 實踐意義
- 項目管理專業(yè)考試中的技巧和策略試題及答案
- 市人民醫(yī)院檢驗科程序文件資料匯編
- 業(yè)主授權(quán)租戶安裝充電樁委托書
- MOOC 警察禮儀-江蘇警官學(xué)院 中國大學(xué)慕課答案
- 生產(chǎn)主管轉(zhuǎn)正述職報告
- 行政執(zhí)法考試-檢察機關(guān)執(zhí)法規(guī)范筆試(2018-2023年)真題摘選含答案
- 河南中醫(yī)藥大學(xué)(中醫(yī)傳承人)單招參考試題庫(含答案)
- 2023年上海市靜安區(qū)中考二模數(shù)學(xué)試卷含答案
- 華中農(nóng)業(yè)大學(xué)自薦信
- 員工派駐外地工作協(xié)議
- 家國六情:蒙曼品最美唐詩
- 《拉瑪澤呼吸法》課件
評論
0/150
提交評論