FAFU機(jī)器學(xué)習(xí)05-2 Linear Model for Classification課件_第1頁
FAFU機(jī)器學(xué)習(xí)05-2 Linear Model for Classification課件_第2頁
FAFU機(jī)器學(xué)習(xí)05-2 Linear Model for Classification課件_第3頁
FAFU機(jī)器學(xué)習(xí)05-2 Linear Model for Classification課件_第4頁
FAFU機(jī)器學(xué)習(xí)05-2 Linear Model for Classification課件_第5頁
已閱讀5頁,還剩31頁未讀, 繼續(xù)免費(fèi)閱讀

下載本文檔

版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請進(jìn)行舉報(bào)或認(rèn)領(lǐng)

文檔簡介

FoundationsofMachineLearning

LinearModelforClassification2023/11/4LinearModelforClassificationLesson4-1LinearModelforClassificationLogisticRegression(邏輯回歸或者對數(shù)幾率回歸)TuningmodelswithgridsearchMulti-classclassification(多分類學(xué)習(xí))Class-imbalance(類別不平衡)問題Multi-labelclassification(多標(biāo)簽分類)2023/11/4LinearModelforClassificationLesson5-2LogisticRegression(邏輯回歸)InLinearRegression,wediscussedsimplelinearregression,multiplelinearregression,andpolynomialregression.Thesemodelsarespecialcasesofthegeneralizedlinearmodel(廣義線性模型),aflexibleframeworkthatrequiresfewerassumptionsthanordinarylinearregression.Inthislesson,wewilldiscusssomeoftheseassumptionsastheyrelatetoanotherspecialcaseofthegeneralizedlinearmodelcalledlogisticregression.2023/11/4LinearModelforClassificationLesson5-3邏輯回歸邏輯回歸是一種將數(shù)據(jù)分類為離散結(jié)果的方法。分類問題其實(shí)和回歸問題相似,不同的是分類問題需要預(yù)測的是一些離散值而不是連續(xù)值。例如,我們可以使用邏輯回歸將電子郵件分類為垃圾郵件或非垃圾郵件。能不能直接使用回歸分析處理分類問題?2023/11/4LinearModelforClassificationLesson5-4BinaryclassificationwithlogisticRegressionInlogisticregression,theresponsevariabledescribestheprobabilitythattheoutcomeisthepositivecase.Iftheresponsevariableisequaltoorexceedsadiscriminationthreshold,thepositiveclassispredicted;otherwise,thenegativeclassispredicted.Theresponsevariableismodeledasafunctionofalinearcombinationoftheexplanatoryvariablesusingthelogisticfunction.2023/11/4LinearModelforClassificationLesson5-5CostFunction我們不能在邏輯回歸中使用和線性回歸相同的costfunction,因?yàn)槠漭敵鰰?huì)是波動(dòng)的,出現(xiàn)很多局部最小值,即它將不是‘凸函數(shù)’。2023/11/4LinearModelforClassificationLesson5-6因?yàn)閥={0,1},可以將目標(biāo)函數(shù)做如下簡化:2023/11/4LinearModelforClassificationLesson5-7CostFunctionforSolvingOverfitting我們同樣可以將正則化應(yīng)用到邏輯回歸中,解決過擬合的問題。2023/11/4LinearModelforClassificationLesson5-8sklearn.linear_model.LogisticRegressionclasssklearn.linear_model.LogisticRegression(penalty=’l2’,dual=False,tol=0.0001,C=1.0,fit_intercept=True,intercept_scaling=1,class_weight=None,random_state=None,solver=’

lbfgs’,max_iter=100,multi_class=’warn’,verbose=0,warm_start=False,n_jobs=None)https:///lcqin111/article/details/83861476/2019-02-11/p/10597118.html2023/11/4LinearModelforClassificationLesson5-9LinearModelforClassificationLogisticRegression(邏輯回歸或者對數(shù)幾率回歸)TuningmodelswithgridsearchMulti-classclassification(多分類學(xué)習(xí))Class-imbalance(類別不平衡)問題Multi-labelclassification(多標(biāo)簽分類)2023/11/4LinearModelforClassificationLesson5-10TuningmodelswithgridsearchHyperparametersareparametersofthemodelthatarenotlearned.Forexample,hyperparametersofourlogisticregressionSMSclassifierincludethevalueoftheregularizationtermandthresholdsusedtoremovewordsthatappeartoofrequentlyorinfrequently.Inscikit-learn,hyperparametersaresetthroughtheconstructor.Inthepreviousexamples,wedidnotsetanyargumentsforLogisticRegression();weusedthedefaultvaluesforallofthehyperparameters.Thesedefaultvaluesareoftenagoodstart,buttheymaynotproducetheoptimalmodel.Gridsearchisacommonmethodtoselectthehyperparametervaluesthatproducethebestmodel.Gridsearchtakesasetofpossiblevaluesforeachhyperparameterthatshouldbetuned,andevaluatesamodeltrainedoneachelementoftheCartesianproductofthesets.Thatis,gridsearchisanexhaustivesearchthattrainsandevaluatesamodelforeachpossiblecombinationofthehyperparametervaluessuppliedbythedeveloper.2023/11/4LinearModelforClassificationLesson5-11TuningmodelswithgridsearchGridSearchCVclasssklearn.model_selection.GridSearchCV(estimator,param_grid,*,scoring=None,n_jobs=None,iid='deprecated',refit=True,cv=None,verbose=0,pre_dispatch='2*n_jobs',error_score=nan,return_train_score=False)RandomizedSearchCVclasssklearn.model_selection.RandomizedSearchCV(estimator,param_distributions,*,n_iter=10,scoring=None,n_jobs=None,iid='deprecated',refit=True,cv=None,verbose=0,pre_dispatch='2*n_jobs',random_state=None,error_score=nan,return_train_score=False)[source]2023/11/4LinearModelforClassificationLesson5-12TuningmodelswithgridsearchGridSearchCV2023/11/4LinearModelforClassificationLesson5-13pipeline=Pipeline([('vect',TfidfVectorizer(stop_words='english')),('clf',LogisticRegression())])parameters={'vect__max_df':(0.25,0.5,0.75),'vect__stop_words':('english',None),'vect__max_features':(2500,5000,10000,None),'vect__ngram_range':((1,1),(1,2)),'vect__use_idf':(True,False),'vect__norm':('l1','l2'),'clf__penalty':('l1','l2'),'clf__C':(0.01,0.1,1,10),}TuningmodelswithgridsearchGridSearchCV2023/11/4LinearModelforClassificationLesson5-14file_name=path.dirname(__file__)+\

"/../data/SMSSpamCollection.txt"X,y=[],[]withopen(file_name,'r',encoding='UTF-8')asfile:line=file.readline()whileline:d=line.split("\t")X.append(d[1])y.append(d[0])line=file.readline()

a={'ham':0,'spam':1}y=[a[s]forsiny]TuningmodelswithgridsearchGridSearchCV2023/11/4LinearModelforClassificationLesson5-15grid_search=GridSearchCV(pipeline,parameters,n_jobs=-1,\verbose=1,scoring='accuracy',cv=3)X_train,X_test,y_train,y_test=train_test_split(X,y)grid_search.fit(X_train,y_train)print('Bestscore:%0.3f'%grid_search.best_score_)print('Bestparametersset:')best_parameters=grid_search.best_estimator_.get_params()forparam_nameinsorted(parameters.keys()):print('\t%s:%r'%(param_name,best_parameters[param_name]))predictions=grid_search.predict(X_test)print('Accuracy:',accuracy_score(y_test,predictions))print('Precision:',precision_score(y_test,predictions))print('Recall:',recall_score(y_test,predictions))LinearModelforClassificationLogisticRegression(邏輯回歸或者對數(shù)幾率回歸)TuningmodelswithgridsearchMulti-classclassification(多分類學(xué)習(xí))Class-imbalance(類別不平衡)問題Multi-labelclassification(多標(biāo)簽分類)2023/11/4LinearModelforClassificationLesson5-16Multi-classclassificationThegoalofmulti-classclassificationistoassignaninstancetooneofthesetofclasses.scikit-learnusesone-vs.-all(one-vs.-the-rest)ormultinomial,tosupportmulti-classclassification.One-vs.-allclassificationusesonebinaryclassifierforeachofthepossibleclasses.Theclassthatispredictedwiththegreatestconfidenceisassignedtotheinstance.LogisticRegressionsupportsmulti-classclassificationusingtheone-versus-allstrategyoutofthebox.2023/11/4LinearModelforClassificationLesson5-17拆分策略最經(jīng)典的拆分策略有三種.“一對一”(Onevs.One,簡稱OvO)、“一對其余"(Onevs.Rest,簡稱OvR)和"多對多"(Manyvs.Many,簡稱MvM).OvO將這N個(gè)類別兩兩配對,從而產(chǎn)生N(N一1)/2個(gè)三分類任務(wù),例如OvO將為區(qū)分類別Ci和Cj訓(xùn)練個(gè)分類器,該分類器把D中的Ci類樣例作為正例,Cj類樣例作為反例.在測試階段,新樣本將同時(shí)提交給所有分類器,于是我們將得到N(N-1)/2個(gè)分類結(jié)果,最終結(jié)果可通過投票產(chǎn)生:即把被預(yù)測得最多的類別作為最終分類結(jié)果.2023/11/4LinearModelforClassificationLesson5-18拆分策略最經(jīng)典的拆分策略有三種.“一對一”(Onevs.One,簡稱OvO)、“一對其余"(Onevs.Rest,簡稱OvR)和"多對多"(Manyvs.Many,簡稱MvM).OvR則是每次將一個(gè)類的樣例作為正例、所有其他類的樣例作為反例來訓(xùn)練N個(gè)分類器.在測試時(shí)若僅有一個(gè)分類器預(yù)測為正類,則對應(yīng)的類別標(biāo)記作為最終分類結(jié)果.若有多個(gè)分類器預(yù)測為正類,則通??紤]各分類器的預(yù)測置信度,選擇置信度最大的類別標(biāo)記作為分類結(jié)果.2023/11/4LinearModelforClassificationLesson5-19拆分策略最經(jīng)典的拆分策略有三種.“一對一”(Onevs.One,簡稱OvO)、“一對其余"(Onevs.Rest,簡稱OvR)和"多對多"(Manyvs.Many,簡稱MvM).MvM是每次將若干個(gè)類作為正類,若干個(gè)其他類作為反類.顯然,OvO和OvR是MvM的特例.MvM的正、反類構(gòu)造必須有特殊的設(shè)計(jì),不能隨意選取.一種最常用的MvM技術(shù)是“糾錯(cuò)輸出碼”(ErrorCorrectingOutputCodes,簡稱ECOC).ECOC是將編碼的思想引入類別拆分,并盡可能在解碼過程中具有容錯(cuò)性.2023/11/4LinearModelforClassificationLesson5-20sklearn.linear_model.LogisticRegressionclasssklearn.linear_model.LogisticRegression(penalty=’l2’,dual=False,tol=0.0001,C=1.0,fit_intercept=True,intercept_scaling=1,class_weight=None,random_state=None,solver=’

lbfgs’,max_iter=100,multi_class=’warn’,verbose=0,warm_start=False,n_jobs=None)multi_class:str,{‘ovr’,‘multinomial’,‘a(chǎn)uto’},default:‘ovr’Iftheoptionchosenis‘ovr’,thenabinaryproblemisfitforeachlabel.For‘multinomial’thelossminimisedisthemultinomiallossfitacrosstheentireprobabilitydistribution,evenwhenthedataisbinary.‘multinomial’isunavailablewhensolver=’liblinear’.‘a(chǎn)uto’selects‘ovr’ifthedataisbinary,orifsolver=’liblinear’,andotherwiseselects‘multinomial’.2023/11/4LinearModelforClassificationLesson5-212023/11/4LinearModelforClassificationLesson5-22sklearn.multiclass.OneVsRestClassifier(estimator,n_jobs=None)One-vs-the-rest(OvR)multiclass/multilabelstrategyAlsoknownasone-vs-all,thisstrategyconsistsinfittingoneclassifierperclass.Thisstrategycanalsobeusedformultilabellearning,whereaclassifierisusedtopredictmultiplelabelsforinstance,byfittingona2-dmatrixinwhichcell[i,j]is1ifsampleihaslabeljand0otherwise.https:///stable/modules/generated/sklearn.multiclass.OneVsRestClassifier.html#sklearn.multiclass.OneVsRestClassifier2023/11/4LinearModelforClassificationLesson5-23sklearn.multiclass.OneVsOneClassifier(estimator,n_jobs=None)One-vs-onemulticlassstrategyThisstrategyconsistsinfittingoneclassifierperclasspair.Atpredictiontime,theclasswhichreceivedthemostvotesisselected.Sinceitrequirestofitn_classes*(n_classes-1)/2classifiers,thismethodisusuallyslowerthanone-vs-the-rest,duetoitsO(n_classes^2)complexity.However,thismethodmaybeadvantageousforalgorithmssuchaskernelalgorithmswhichdon’tscalewellwithn_samples.Thisisbecauseeachindividuallearningproblemonlyinvolvesasmallsubsetofthedatawhereas,withone-vs-the-rest,thecompletedatasetisusedn_classestimes.https:///stable/modules/generated/sklearn.multiclass.OneVsOneClassifier.html#sklearn.multiclass.OneVsOneClassifier2023/11/4LinearModelforClassificationLesson5-24sklearn.multiclass.OutputCodeClassifier(Error-Correcting)Output-CodemulticlassstrategyOutput-codebasedstrategiesconsistinrepresentingeachclasswithabinarycode(anarrayof0sand1s).Atfittingtime,onebinaryclassifierperbitinthecodebookisfitted.Atpredictiontime,theclassifiersareusedtoprojectnewpointsintheclassspaceandtheclassclosesttothepointsischosen.Themainadvantageofthesestrategiesisthatthenumberofclassifiersusedcanbecontrolledbytheuser,eitherforcompressingthemodel(0<code_size<1)orformakingthemodelmorerobusttoerrors(code_size>1).Seethedocumentationformoredetails.https:///stable/modules/generated/sklearn.multiclass.OutputCodeClassifier.html#sklearn.multiclass.OutputCodeClassifier2023/11/4LinearModelforClassificationLesson5-25LinearModelforClassificationLogisticRegression(邏輯回歸或者對數(shù)幾率回歸)Multi-classclassification(多分類學(xué)習(xí))Class-imbalance(類別不平衡)問題Multi-labelclassification(多標(biāo)簽分類)2023/11/4LinearModelforClassificationLesson5-26class-imbalance(類別不平衡)問題類別不平衡(class-imbalance)就是指分類任務(wù)中不同類別的訓(xùn)練樣例數(shù)目差別很大的情況.不失一般性,假定正類樣例較少,反類樣例較多.在現(xiàn)實(shí)的分類學(xué)習(xí)任務(wù)中,我們經(jīng)常會(huì)遇到類別不平衡問題。第一類是直接對訓(xùn)練集里的反類樣例進(jìn)行“欠采樣”(undersampling),即去除一些反例使得正、反例數(shù)日接近,然后再進(jìn)行學(xué)習(xí);第二類是對訓(xùn)練集里的正類樣例進(jìn)行“過采樣"(oversampling),即增加一些正例使得正、反例數(shù)目接近,然后再進(jìn)行學(xué)習(xí);第三類則是直接基于原始訓(xùn)練集進(jìn)行學(xué)習(xí),但在用訓(xùn)練好的分類器進(jìn)行預(yù)測時(shí),將下式嵌入到其決策過程中,稱為“閾值移動(dòng)"(threshold-moving).2023/11/4LinearModelforClassificationLesson5-27InScikitlearntherearesomeimbalancecorrectiontechniques,whichvaryaccordingwithwhichlearningalgorithmareyouusing.Someoneofthem,likeSvmorlogisticregression,havetheclass_weightparameter.IfyouinstantiateanSVCwiththisparameterseton'auto',itwillweighteachclassexampleproportionallytotheinverseofitsfrequency.Unfortunately,thereisn'tapreprocessortoolwiththispurpose.2023/11/4LinearModelforClassificationLesson5-28https:///scikit-learn-contrib/imbalanced-learnItcontainsmanyalgorithmsinthefollowingcategories,includingSMOTEUnder-samplingthemajorityclass(es).Over-samplingtheminorityclass.Combiningover-andunder-sampling.Createensemblebalancedsets.2023/11/4LinearModelforClassificationLesson5-29SyntheticMinorityOversamplingTechniqueSMOTE,即合成少數(shù)類過采樣技術(shù)它是基于隨機(jī)過采樣算法的一種改進(jìn)方案,由于隨機(jī)過采樣采取簡單復(fù)制樣本的策略來增加少數(shù)類樣本,這樣容易產(chǎn)生模型過擬合的問題,即使得模型學(xué)習(xí)到的信息過于特別(Specific)而不夠泛化(General),SMOTE算法流程如下:對于少數(shù)類中每一個(gè)樣本x,以歐氏距離為標(biāo)準(zhǔn)計(jì)算它到少數(shù)類樣本集Smin中所有樣本的距離,得到其k近鄰。根據(jù)樣本不平衡比例設(shè)置一個(gè)采樣比例以確定采樣倍率N,對于每一個(gè)少數(shù)類樣本x,從其k近鄰中隨機(jī)選擇若干個(gè)樣本,假設(shè)選擇的近鄰為xn。對于每一個(gè)隨機(jī)選出的近鄰xn,分別與原樣本按照如下的公式構(gòu)建新的樣本xnew=x+rand(0,1)?|x?xn|2023/11/4LinearModelforClassificationLesson5-30LinearModelforClassificationLogisticRegression(邏輯回歸或者對數(shù)幾率回歸)TuningmodelswithgridsearchMulti-classclassification(多分類學(xué)習(xí))Class-imbalance(類別不平衡)問題Multi-labelclassification(多標(biāo)簽分類)2023/11/4LinearModelforClassificationLesson5-31Multi-labelclassificationandproblemtransformationProblemtransformationmethodsaretechniquesthatcasttheoriginalmulti-labelproblemasasetofsingle-labelclassificationproblems.ConverteachsetoflabelsencounteredinthetrainingdatatoasinglelabelTrainonebinaryclassifierforeachofthelabelsinthetrainingset2023/11/4LinearModelforClassificationLesson5-322023/11/4LinearModelforClassificationLesson5-33sklearn.multiclass.OneVsRestClassifier(estimator,n_jobs=None)One-vs-the-rest(OvR)multiclass/multilabelstrategyAlsoknownaso

溫馨提示

  • 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
  • 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
  • 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會(huì)有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
  • 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
  • 5. 人人文庫網(wǎng)僅提供信息存儲(chǔ)空間,僅對用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對任何下載內(nèi)容負(fù)責(zé)。
  • 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請與我們聯(lián)系,我們立即糾正。
  • 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。

最新文檔

評論

0/150

提交評論