統(tǒng)計(jì)軟件R作業(yè)-adult、babiesI數(shù)據(jù)_第1頁(yè)
統(tǒng)計(jì)軟件R作業(yè)-adult、babiesI數(shù)據(jù)_第2頁(yè)
統(tǒng)計(jì)軟件R作業(yè)-adult、babiesI數(shù)據(jù)_第3頁(yè)
統(tǒng)計(jì)軟件R作業(yè)-adult、babiesI數(shù)據(jù)_第4頁(yè)
統(tǒng)計(jì)軟件R作業(yè)-adult、babiesI數(shù)據(jù)_第5頁(yè)
已閱讀5頁(yè),還剩45頁(yè)未讀, 繼續(xù)免費(fèi)閱讀

下載本文檔

版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請(qǐng)進(jìn)行舉報(bào)或認(rèn)領(lǐng)

文檔簡(jiǎn)介

《數(shù)據(jù)分析與統(tǒng)計(jì)軟件》作業(yè)姓名:楊燁軍學(xué)號(hào):2010110148

——adult、babiesI數(shù)據(jù)統(tǒng)計(jì)軟件R作業(yè)——adult、babiesI數(shù)據(jù)共50頁(yè),您現(xiàn)在瀏覽的是第1頁(yè)!

部分

adult數(shù)據(jù)統(tǒng)計(jì)軟件R作業(yè)——adult、babiesI數(shù)據(jù)共50頁(yè),您現(xiàn)在瀏覽的是第2頁(yè)!22.4:組合方法:adaboost、bagging、隨機(jī)森林分析2.1、2.2、2.3:rpart分析2.5:最近鄰方法分析2.6:人工神經(jīng)網(wǎng)絡(luò)分析2.8:關(guān)聯(lián)規(guī)則分析2.7:支持向量機(jī)分析1數(shù)據(jù)簡(jiǎn)介目錄統(tǒng)計(jì)軟件R作業(yè)——adult、babiesI數(shù)據(jù)共50頁(yè),您現(xiàn)在瀏覽的是第3頁(yè)!變量描述序號(hào)變量名序號(hào)變量名1class:分類(>50K,<=50K)9relationship:關(guān)系(妻子、丈夫等)2age:年齡(連續(xù)變量)10race:種族(白人、黑人等)3workclass:工作類別(私人、不工作等)11sex:性別(女、男)4fnlwgt:(連續(xù)性變量)12capital.gain:財(cái)產(chǎn)收益(連續(xù)變量)5education:教育(學(xué)士、碩士、博士等)13capital.loss:財(cái)產(chǎn)損失(連續(xù)變量)6education.num:教育年限(連續(xù)變量)14hours.per.week:每周工作時(shí)間(連續(xù)變量)7marital.status:婚姻狀況(未婚、已婚配偶為軍人、已婚配偶為平民等)15native.country:國(guó)籍(美國(guó)、柬埔寨、英國(guó)等)8occupation:職業(yè)(技術(shù)支持、銷售等)統(tǒng)計(jì)軟件R作業(yè)——adult、babiesI數(shù)據(jù)共50頁(yè),您現(xiàn)在瀏覽的是第4頁(yè)!2.1分類樹rpart分析:程序library(rpart);w=read.table("e:/adult.txt",header=TRUE,sep=",");wt=read.table("e:/adulttest.txt",header=TRUE,sep=",");summary(w);summary(wt);(b=rpart(class~.,w));b;plot(b,uniform=T,branch=1,margin=0.1,cex=0.9);text(b,cex=0.85);table(predict(b,w,type="class"),w[["class"]]);table(predict(b,wt,type="class"),wt[["class"]])統(tǒng)計(jì)軟件R作業(yè)——adult、babiesI數(shù)據(jù)共50頁(yè),您現(xiàn)在瀏覽的是第5頁(yè)!2.1分類樹rpart分析:輸出結(jié)果訓(xùn)練集w分類結(jié)果真實(shí)<=50K>50K預(yù)測(cè)<=50K234733816>50K12474025誤判率:0.155493測(cè)試集wt分類結(jié)果

真實(shí)

<=50K>50K預(yù)測(cè)<=50K118051901>50K6301945誤判率:0.155457統(tǒng)計(jì)軟件R作業(yè)——adult、babiesI數(shù)據(jù)共50頁(yè),您現(xiàn)在瀏覽的是第6頁(yè)!2.2分類樹rpart分析:程序(變量篩選1)考慮到education(教育)與education.num(教育年限)相關(guān)性較大,只采用education.num(教育年限)。summary(w);(b1=rpart(class~age+workclass+education.num+marital.status+occupation+race+sex+capital.gain+capital.loss+hours.per.week+native.country,w));b1;plot(b1);text(b1,use.n=T)table(predict(b1,w,type="class"),w[["class"]]);table(predict(b1,wt,type="class"),wt[["class"]])統(tǒng)計(jì)軟件R作業(yè)——adult、babiesI數(shù)據(jù)共50頁(yè),您現(xiàn)在瀏覽的是第7頁(yè)!訓(xùn)練集w分類結(jié)果真實(shí)<=50K>50K預(yù)測(cè)<=50K234433807>50K12774034誤判率:0.156138測(cè)試集wt分類結(jié)果

真實(shí)

<=50K>50K預(yù)測(cè)<=50K118021893>50K6331953誤判率:0.155150

測(cè)試集誤判率略有降低,變量不篩選時(shí)誤判率為0.1554572.2分類樹rpart分析:結(jié)果(變量篩選1)統(tǒng)計(jì)軟件R作業(yè)——adult、babiesI數(shù)據(jù)共50頁(yè),您現(xiàn)在瀏覽的是第8頁(yè)!n=32561node),split,n,loss,yval,(yprob)*denotesterminalnode1)root325617841<=50K(0.759190440.24080956)2)marital.status=Divorced,Married-spouse-absent,Never-married,Separated,Widowed175621139<=50K(0.935144060.06485594)*3)marital.status=Married-AF-spouse,Married-civ-spouse149996702<=50K(0.553170210.44682979)6)education.num<12.5105263484<=50K(0.669010070.33098993)*7)education.num>=12.544731255>50K(0.280572320.71942768)*婚姻狀況:離婚、配偶失蹤、分居等婚姻狀況:已婚有配偶受教育年限不考慮財(cái)產(chǎn)收益與損耗情況下,收入類別與婚姻狀況、受教育年限關(guān)系較強(qiáng)。2.3分類樹rpart分析:結(jié)果(變量篩選2)統(tǒng)計(jì)軟件R作業(yè)——adult、babiesI數(shù)據(jù)共50頁(yè),您現(xiàn)在瀏覽的是第9頁(yè)!library(adabag);b4=adaboost.M1(class~.,data=w,mfinal=15,maxdepth=5)b4.pred<-predict.boosting(b4,newdata=w);b4.pred[-1]b5.pred<-predict.boosting(b4,newdata=wt);b5.pred[-1]barplot(b4$importance)b4$importance訓(xùn)練集:ObservedClassPredictedClass<=50K>50K<=50K238513904>50K8693937$error[1]0.1465864測(cè)試集:ObservedClassPredictedClass<=50K.>50K.<=50K.124353846$error[1]0.2362263測(cè)試集中全部判斷為<=50K。2.4組合方法之a(chǎn)daboost分析統(tǒng)計(jì)軟件R作業(yè)——adult、babiesI數(shù)據(jù)共50頁(yè),您現(xiàn)在瀏覽的是第10頁(yè)!library(mlbench);b6=adaboost.M1(class~.,data=w,mfinal=25,maxdepth=5)b6.pred<-predict.boosting(b6,newdata=w);b6.pred[-1]b7.pred<-predict.boosting(b6,newdata=wt);b7.pred[-1]barplot(b6$importance)b6$importance訓(xùn)練集:ObservedClassPredictedClass<=50K>50K<=50K234653366>50K12554475$error[1]0.1419182測(cè)試集:ObservedClassPredictedClass<=50K.>50K.<=50K.124353846$error[1]0.2362263測(cè)試集中仍全部判斷為<=50K。mfinal增加至25訓(xùn)練集誤判率有所下降,相差不大2.4組合方法之a(chǎn)daboost分析統(tǒng)計(jì)軟件R作業(yè)——adult、babiesI數(shù)據(jù)共50頁(yè),您現(xiàn)在瀏覽的是第11頁(yè)!b8=bagging(class~.,data=w,mfinal=25,maxdepth=5)b8.pred=predict.bagging(b8,newdata=w);b8.pred[-1]b9.pred=predict.bagging(b8,newdata=wt);b9.pred[-1]barplot(b8$importance)b8$importanceCompanyLogo訓(xùn)練集:ObservedClassPredictedClass<=50K>50K<=50K234733816>50K12474025$error[1]0.1554928測(cè)試集:ObservedClassPredictedClass<=50K.>50K.<=50K.124353846$error[1]0.2362263測(cè)試集中仍全部判斷為<=50K。與adaboost方法相比,訓(xùn)練集誤判率有所上升2.4組合方法之bagging分析統(tǒng)計(jì)軟件R作業(yè)——adult、babiesI數(shù)據(jù)共50頁(yè),您現(xiàn)在瀏覽的是第12頁(yè)!>b10=randomForest(class~.,data=w,importance=TRUE)錯(cuò)誤于randomForest.default(m,y,...):Cannothandlecategoricalpredictorswithmorethan32categories.randomForest不能處理32個(gè)分類以上的名義變量的預(yù)測(cè),native.country(國(guó)籍)變量有41個(gè)類別,剔除native.country。>b10=randomForest(class~age+workclass+fnlwgt+education+education.num+marital.status+occupation+relationship+race,data=w,importance=TRUE)#剔除native.country錯(cuò)誤:無(wú)法分配大小為248.4Mb的矢量此外:警告信息:1:Inas.vector(data):Reachedtotalallocationof1023Mb:seehelp(memory.size)…由于訓(xùn)練集觀測(cè)較多,計(jì)32561個(gè)觀測(cè),處理量較大,無(wú)法處理。2.4組合方法之隨機(jī)森林分析:程序統(tǒng)計(jì)軟件R作業(yè)——adult、babiesI數(shù)據(jù)共50頁(yè),您現(xiàn)在瀏覽的是第13頁(yè)!par(mfrow=c(1,2));for(iin1:2)barplot(t(importance(b11))[i,],s=0.7)2.4組合方法之隨機(jī)森林分析:輸出結(jié)果統(tǒng)計(jì)軟件R作業(yè)——adult、babiesI數(shù)據(jù)共50頁(yè),您現(xiàn)在瀏覽的是第14頁(yè)!2.5最近鄰方法分析:結(jié)果測(cè)試集wt分類結(jié)果(1/10)

真實(shí)

<=50K>50K預(yù)測(cè)<=50K122246>50K28278誤判率:0.201474由分類結(jié)果可以看出,誤判率為0.201474,與前面不平衡數(shù)據(jù)的組合分析相比,分類效果好。統(tǒng)計(jì)軟件R作業(yè)——adult、babiesI數(shù)據(jù)共50頁(yè),您現(xiàn)在瀏覽的是第15頁(yè)!2.6人工神經(jīng)網(wǎng)絡(luò)分析:結(jié)果size訓(xùn)練集w誤判率測(cè)試集wt誤判率20.240810.2362330.240290.2351840.240750.2361050.240350.2354360.186480.1910270.197570.1969280.195540.1940390.195880.19471隱藏層節(jié)點(diǎn)數(shù)由2至9的訓(xùn)練集w、測(cè)試集wt的誤判率見左下表。節(jié)點(diǎn)數(shù)為10時(shí)由于toomany(1123)weights,無(wú)法處理??梢钥闯?,節(jié)點(diǎn)數(shù)為6時(shí),訓(xùn)練集w、測(cè)試集wt的誤判率均最小。節(jié)點(diǎn)數(shù)為6,訓(xùn)練集w分類結(jié)果

真實(shí)

<=50K>50K預(yù)測(cè)<=50K220372683>50K33894452誤判率:0.186481節(jié)點(diǎn)數(shù)為6,測(cè)試集wt分類結(jié)果

真實(shí)

<=50K>50K預(yù)測(cè)<=50K110381397>50K17132133誤判率:0.191020統(tǒng)計(jì)軟件R作業(yè)——adult、babiesI數(shù)據(jù)共50頁(yè),您現(xiàn)在瀏覽的是第16頁(yè)!2.7支持向量機(jī)分析:結(jié)果訓(xùn)練集w分類結(jié)果

真實(shí)

<=50K>50K預(yù)測(cè)<=50K232993307>50K14214534誤判率:0.145204測(cè)試集wt分類結(jié)果

真實(shí)

<=50K>50K預(yù)測(cè)<=50K117141662>50K7212184誤判率:0.146367訓(xùn)練集、測(cè)試集誤判率均有所下降,但區(qū)別不大統(tǒng)計(jì)軟件R作業(yè)——adult、babiesI數(shù)據(jù)共50頁(yè),您現(xiàn)在瀏覽的是第17頁(yè)!rules=apriori(ww,parameter=list(support=0.01,confidence=0.6));inspect(rules[1:10]);x=subset(rules,subset=rhs%in%"class=>50K"&lift>1.2);inspect(SORT(x,by="lift")[1:5]);x=subset(rules,subset=rhs%in%"class=<=50K"&lift>1.2);inspect(SORT(x,by="lift")[1:5])2.8關(guān)聯(lián)規(guī)則分析:程序統(tǒng)計(jì)軟件R作業(yè)——adult、babiesI數(shù)據(jù)共50頁(yè),您現(xiàn)在瀏覽的是第18頁(yè)!lhsrhssupportconfidencelift1{education=11th,hours.per.week=Part-time}=>{class=<=50K}0.0114247111.3171932{age=Young,hours.per.week=Part-time}=>{class=<=50K}0.0573078211.3171933{education=11th,education.num=low,hours.per.week=Part-time}=>{class=<=50K}0.0114247111.3171934{education=11th,hours.per.week=Part-time,native.country=United-States}=>{class=<=50K}0.0107490611.3171935{education=11th,capital.gain=None,hours.per.week=Part-time}=>{class=<=50K}0.0111483111.317193受教育11年的業(yè)余工作的人,年收入一般會(huì)低于5萬(wàn)業(yè)余工作年輕人,一般會(huì)低于5萬(wàn)學(xué)歷較低的業(yè)余工作者,一般會(huì)低于5萬(wàn)受教育11年、業(yè)余工作、美國(guó)籍,一般會(huì)低于5萬(wàn)受教育11年、無(wú)財(cái)產(chǎn)收益的業(yè)余工作者,一般會(huì)低于5萬(wàn)收入較低者的特征:無(wú)穩(wěn)定工作、學(xué)歷較低、無(wú)財(cái)產(chǎn)收益2.8關(guān)聯(lián)規(guī)則分析:結(jié)果統(tǒng)計(jì)軟件R作業(yè)——adult、babiesI數(shù)據(jù)共50頁(yè),您現(xiàn)在瀏覽的是第19頁(yè)!

第二部分

babiesI數(shù)據(jù)統(tǒng)計(jì)軟件R作業(yè)——adult、babiesI數(shù)據(jù)共50頁(yè),您現(xiàn)在瀏覽的是第20頁(yè)!2.方差分析baby=read.table("e:/babiesI.txt",header=TRUE,sep=",")summary(baby)aov1<-aov(bwt~as.factor(smoke),baby)summary(aov1)DfSumSqMeanSqFvaluePr(>F)as.factor(smoke)22391111955.538.109<2.2e-16***Residuals1233386811313.7---Signif.codes:0‘***’0.001‘**’0.01‘*’0.05‘.’0.1‘’1p值遠(yuǎn)小于0.01應(yīng)拒絕原假設(shè),即認(rèn)為母親是否吸煙與嬰兒出生時(shí)體重有顯著的相關(guān)關(guān)系。統(tǒng)計(jì)軟件R作業(yè)——adult、babiesI數(shù)據(jù)共50頁(yè),您現(xiàn)在瀏覽的是第21頁(yè)!Coefficients:EstimateStd.ErrortvaluePr(>|t|)(Intercept)123.04720.6502189.237<2e-16***as.factor(smoke)1-8.93771.0349-8.636<2e-16***as.factor(smoke)93.65285.63860.6480.517---Signif.codes:0‘***’0.001‘**’0.01‘*’0.05‘.’0.1‘’1Residualstandarderror:17.71on1233degreesoffreedomMultipleR-squared:0.05822,AdjustedR-squared:0.05669F-statistic:38.11on2and1233DF,p-value:<2.2e-163.回歸分析:輸出結(jié)果R-squared=0.05822,較??;F-statistic:38.11,較大,伴隨p值小于0.05,說明兩變量間的關(guān)系較強(qiáng)。截距項(xiàng)系數(shù)為123.0472,且0.1%顯著性水平下顯著;as.factor(smoke)1項(xiàng)系數(shù)為-8.9377,且0.1%顯著性水平下顯著;as.factor(smoke)9項(xiàng)系數(shù)不顯著。引入兩個(gè)虛擬變量統(tǒng)計(jì)軟件R作業(yè)——adult、babiesI數(shù)據(jù)共50頁(yè),您現(xiàn)在瀏覽的是第22頁(yè)!4.列聯(lián)表分析:程序baby[["bwt"]]<-ordered(cut(baby[["bwt"]],c(0,88.18,141.10,180)),labels=c("low",“middle","high"));#88.18盎司約合2500克,低于2500克為低體重,正常體重是2500-4000克baby[["smoke"]]<-ordered(cut(baby[["smoke"]],c(-1,0,1,9)),labels=c("no","yes","unkown"));y1=xtabs(~bwt+smoke,data=baby)library(MASS);biplot(corresp(y1,nf=2))chisq.test(y1)統(tǒng)計(jì)軟件R作業(yè)——adult、babiesI數(shù)據(jù)共50頁(yè),您現(xiàn)在瀏覽的是第23頁(yè)!考慮到smoke=9表示不知道是否吸煙,故這里刪去10個(gè)smoke=9的觀測(cè),以考察母親是否吸煙對(duì)嬰兒出生體重的影響,還有1226個(gè)觀測(cè)。baby=read.table("e:/babiesI.txt",header=TRUE,sep=",")aa=baby[baby[,2]!=9,]#剔除smoke=9的樣本,新的數(shù)據(jù)框?yàn)閍a;nrow(aa);summary(aa);lm2<-lm(bwt~as.factor(smoke),aa);summary(lm2);hist(lm2$residuals);qqnorm(lm2$residuals);qqline(lm2$residuals)shapiro.test(lm2$residuals)5.回歸分析:程序(剔除smoke=9的樣本)統(tǒng)計(jì)軟件R作業(yè)——adult、babiesI數(shù)據(jù)共50頁(yè),您現(xiàn)在瀏覽的是第24頁(yè)!5.回歸分析:殘差檢驗(yàn)(剔除smoke=9的樣本)>shapiro.test(lm2$residuals)Shapiro-Wilknormalitytestdata:lm2$residualsW=0.9944,p-value=0.0001463殘差直方圖和QQ圖顯示殘差大致呈正態(tài)分布,但shapiro檢驗(yàn)顯示殘差不是正態(tài)分布。統(tǒng)計(jì)軟件R作業(yè)——adult、babiesI數(shù)據(jù)共50頁(yè),您現(xiàn)在瀏覽的是第25頁(yè)!>chisq.test(y2)Pearson'sChi-squaredtestdata:y2X-squared=21.8115,df=2,p-value=1.835e-056.列聯(lián)表分析:結(jié)果(剔除smoke=9的樣本)chisq檢驗(yàn)顯示母親是否吸煙與嬰兒出生體重有較強(qiáng)的相關(guān)關(guān)系,對(duì)應(yīng)分析圖顯示母親吸煙的嬰兒出生體重相對(duì)“不吸煙”情況而言要輕。統(tǒng)計(jì)軟件R作業(yè)——adult、babiesI數(shù)據(jù)共50頁(yè),您現(xiàn)在瀏覽的是第26頁(yè)!1.數(shù)據(jù)簡(jiǎn)介數(shù)據(jù)來自于1994年人口普查數(shù)據(jù),經(jīng)過年齡>16、AGI>100、AFNLWGT>1和每周工作時(shí)間>0等條件篩選。共有48842個(gè)觀測(cè),其中:訓(xùn)練集32561個(gè)觀測(cè),測(cè)試集16281個(gè)觀測(cè)。15個(gè)變量,其中:6個(gè)連續(xù)性變量,9個(gè)名義變量。資料來源:/ml/datasets/Adult任務(wù):預(yù)測(cè)人們收入是否超過5萬(wàn)/年。統(tǒng)計(jì)軟件R作業(yè)——adult、babiesI數(shù)據(jù)共50頁(yè),您現(xiàn)在瀏覽的是第27頁(yè)!數(shù)據(jù)概覽ageworkclassfnlwgteducationeducation.nummarital.status139State-gov77516Bachelors13Never-married250Self-emp-not-inc83311Bachelors13Married-civ-spouse338Private215646HS-grad9Divorced453Private23472111th7Married-civ-spouse528Private338409Bachelors13Married-civ-spouseoccupationrelationshipracesexcapital.gaincapital.loss1Adm-clericalNot-in-familyWhiteMale217402Exec-managerialHusbandWhiteMale003Handlers-cleanersNot-in-familyWhiteMale004Handlers-cleanersHusbandBlackMale005Prof-specialtyWifeBlackFemale00hours.per.weeknative.countryclass140United-States<=50K213United-States<=50K340United-States<=50K440United-States<=50K540Cuba<=50K統(tǒng)計(jì)軟件R作業(yè)——adult、babiesI數(shù)據(jù)共50頁(yè),您現(xiàn)在瀏覽的是第28頁(yè)!2.1分類樹rpart分析:輸出結(jié)果n=32561node),split,n,loss,yval,(yprob)*denotesterminalnode1)root325617841<=50K(0.759190440.24080956)2)relationship=Not-in-family,Other-relative,Own-child,Unmarried178001178<=50K(0.933820220.06617978)4)capital.gain<7073.517482872<=50K(0.950120120.04987988)*5)capital.gain>=7073.531812>50K(0.037735850.96226415)*3)relationship=Husband,Wife147616663<=50K(0.548607820.45139218)6)education=10th,11th,12th,1st-4th,5th-6th,7th-8th,9th,Assoc-acdm,Assoc-voc,HS-grad,Preschool,Some-college103293456<=50K(0.665408070.33459193)12)capital.gain<5095.598072944<=50K(0.699806260.30019374)*13)capital.gain>=5095.552210>50K(0.019157090.98084291)*7)education=Bachelors,Doctorate,Masters,Prof-school44321225>50K(0.276398920.72360108)*關(guān)系:未婚、自己為孩子、不在家庭、其他關(guān)系:丈夫、妻子學(xué)歷較高學(xué)歷較低財(cái)產(chǎn)收益大于5096財(cái)產(chǎn)收益大于7074財(cái)產(chǎn)收益小于7074財(cái)產(chǎn)收益小于5096統(tǒng)計(jì)軟件R作業(yè)——adult、babiesI數(shù)據(jù)共50頁(yè),您現(xiàn)在瀏覽的是第29頁(yè)!2.1分類樹rpart分析:結(jié)論可見:年工資收入是否超過5萬(wàn),與個(gè)人在家庭中擔(dān)任的角色、所受教育和財(cái)產(chǎn)收益有關(guān)。個(gè)人如果是家庭的丈夫或者妻子,收入相對(duì)較高;所受教育越高,收入相對(duì)較高;財(cái)產(chǎn)收益越高,收入相對(duì)較高。判斷一個(gè)人年收入是否超過5萬(wàn),可從關(guān)系、教育、財(cái)產(chǎn)收益三個(gè)變量表現(xiàn)來決定。統(tǒng)計(jì)軟件R作業(yè)——adult、babiesI數(shù)據(jù)共50頁(yè),您現(xiàn)在瀏覽的是第30頁(yè)!2.2分類樹rpart分析:輸出(變量篩選1)n=32561node),split,n,loss,yval,(yprob)*denotesterminalnode1)root325617841<=50K(0.759190440.24080956)2)marital.status=Divorced,Married-spouse-absent,Never-married,Separated,Widowed175621139<=50K(0.935144060.06485594)4)capital.gain<7139.517252840<=50K(0.951309990.04869001)*5)capital.gain>=7139.531011>50K(0.035483870.96451613)*3)marital.status=Married-AF-spouse,Married-civ-spouse149996702<=50K(0.553170210.44682979)6)education.num<12.5105263484<=50K(0.669010070.33098993)12)capital.gain<5095.599982967<=50K(0.703240650.29675935)*13)capital.gain>=5095.552811>50K(0.020833330.97916667)*7)education.num>=12.544731255>50K(0.280572320.71942768)*婚姻狀況:離婚、配偶失蹤、喪偶等婚姻狀況:已婚有配偶學(xué)歷較高學(xué)歷較低可見:年工資收入是否超過5萬(wàn),與婚姻狀況、所受教育和財(cái)產(chǎn)收益有關(guān)。統(tǒng)計(jì)軟件R作業(yè)——adult、babiesI數(shù)據(jù)共50頁(yè),您現(xiàn)在瀏覽的是第31頁(yè)!再考慮到capital.gain、capital.loss本身與收入類別緊密相關(guān),為挖掘其余變量與收入類別的關(guān)系,這里分析中不包括capital.gain與capital.loss變量。(b2=rpart(class~age+workclass+education.num+marital.status+occupation+race+sex+hours.per.week+native.country,w));b2;plot(b2);text(b2,use.n=T)table(predict(b2,w,type="class"),w[["class"]]);table(predict(b2,wt,type="class"),wt[["class"]])2.3分類樹rpart分析:程序(變量篩選2)統(tǒng)計(jì)軟件R作業(yè)——adult、babiesI數(shù)據(jù)共50頁(yè),您現(xiàn)在瀏覽的是第32頁(yè)!訓(xùn)練集w分類結(jié)果

真實(shí)

<=50K>50K預(yù)測(cè)<=50K234654623>50K12553218誤判率:0.180523測(cè)試集wt分類結(jié)果

真實(shí)

<=50K>50K預(yù)測(cè)<=50K118112273>50K6241573誤判率:0.177937與前面分析相比,訓(xùn)練集、測(cè)試集誤判率均有所上升,因?yàn)檫@里少了財(cái)產(chǎn)收益和損失的信息。2.3分類樹rpart分析:結(jié)果(變量篩選2)統(tǒng)計(jì)軟件R作業(yè)——adult、babiesI數(shù)據(jù)共50頁(yè),您現(xiàn)在瀏覽的是第33頁(yè)!>b4$importanceageworkclassfnlwgteducationeducation.num11.7647060.0000000.00000015.2941181.176471marital.statusoccupationrelationshipracesex7.05882412.9411769.4117650.0000000.000000capital.gaincapital.losshours.per.weeknative.country24.7058829.4117658.2352940.000000重要性較強(qiáng)的變量有:capital.gaineducationoccupationage2.4組合方法之a(chǎn)daboost分析統(tǒng)計(jì)軟件R作業(yè)——adult、babiesI數(shù)據(jù)共50頁(yè),您現(xiàn)在瀏覽的是第34頁(yè)!>b6$importanceageworkclassfnlwgteducationeducation.num10.66666670.00000000.000000012.66666672.0000000marital.statusoccupationrelationshipracesex6.666666712.66666679.33333330.00000000.0000000capital.gaincapital.losshours.per.weeknative.country26.000000010.00000009.33333330.6666667重要性較強(qiáng)的變量有:capital.gainoccupationeducationagecapital.loss2.4組合方法之a(chǎn)daboost分析統(tǒng)計(jì)軟件R作業(yè)——adult、babiesI數(shù)據(jù)共50頁(yè),您現(xiàn)在瀏覽的是第35頁(yè)!>b8$importanceageworkclassfnlwgteducationeducation.num0.0000000.0000000.00000018.5185193.703704marital.statusoccupationrelationshipracesex0.0000003.70370423.1481480.0000000.000000capital.gaincapital.losshours.per.weeknative.country49.0740741.8518520.0000000.000000重要性較強(qiáng)的變量與前面有所差異:capital.gain

relationshipeducationoccupationeducation.numcapital.loss2.4組合方法之bagging分析統(tǒng)計(jì)軟件R作業(yè)——adult、babiesI數(shù)據(jù)共50頁(yè),您現(xiàn)在瀏覽的是第36頁(yè)!考慮從w中抽取1/3為訓(xùn)練集,則可運(yùn)行。m=nrow(w);m;set.seed(1);samp=sample(1:m,floor(m/3));b11=randomForest(class~age+workclass+fnlwgt+education+education.num+marital.status+occupation+relationship+race,data=w[samp,],importance=TRUE)table(predict(b11,w,type="class"),w[["class"]])table(predict(b11,wt,type="class"),wt[["class"]])w分類結(jié)果

真實(shí)

<=50K>50K預(yù)測(cè)<=50K231612524>50K15595317誤判率:0.125395wt分類結(jié)果

真實(shí)

<=50K>50K預(yù)測(cè)<=50K113381682>50K10972164誤判率:0.1706902.4組合方法之隨機(jī)森林分析:程序統(tǒng)計(jì)軟件R作業(yè)——adult、babiesI數(shù)據(jù)共50頁(yè),您現(xiàn)在瀏覽的是第37頁(yè)!2.5最近鄰方法分析:程序訓(xùn)練集w有32561個(gè)觀測(cè),測(cè)試集wt有16281個(gè)觀測(cè),觀測(cè)較多R軟件較難處理。這里分別取訓(xùn)練集w和測(cè)試集wt的1/10進(jìn)行處理。library(kknn);w=read.table("e:/adult.txt",header=TRUE,sep=",");wt=read.table("e:/adulttest.txt",header=TRUE,sep=",");n=nrow(w);set.seed(1);test=sample(1:n,n/10)n1=nrow(wt);set.seed(2);test1=sample(1:n1,n1/10)a=kknn(class~.,w[test,],wt[test1,])table(wt[test1,]$class,a$fit)統(tǒng)計(jì)軟件R作業(yè)——adult、babiesI數(shù)據(jù)共50頁(yè),您現(xiàn)在瀏覽的是第38頁(yè)!2.6人工神經(jīng)網(wǎng)絡(luò)分析:程序library(nnet);library(mlbench);w=read.table("e:/adult.txt",header=TRUE,sep=",");wt=read.table("e:/adulttest.txt",header=TRUE,sep=",");w.nn1=nnet(class~.,data=w,size=2,rang=0.1,decay=5e-4,maxit=1000)table(w$class,predict(w.nn1,w,type="class"))table(wt$class,predict(w.nn1,wt,type="class"))統(tǒng)計(jì)軟件R作業(yè)——adult、babiesI數(shù)據(jù)共50頁(yè),您現(xiàn)在瀏覽的是第39頁(yè)!2.7支持向量機(jī)分析:程序library(mlbench);library(e1071);w=read.table("e:/adult.txt",header=TRUE,sep=",");wt=read.table("e:/adulttest.txt",header=TRUE,sep=",");ww=rbind(w,wt);summary(ww)model<-svm(class~.,data=ww[1:32561,],kernal="sigmoid")pred.train<-fitted(model)(r1=table(pred.train,ww$class[1:32561]))pred.test<-predict(model,ww[32562:48842,-15])(r2=table(pred.test,ww$class[32562:48842]))統(tǒng)計(jì)軟件R作業(yè)——adult、babiesI數(shù)據(jù)共50頁(yè),您現(xiàn)在瀏覽的是第40頁(yè)!library(arules);w=read.table("e:/adult.txt",header=TRUE,sep=",");summary(w);w[["fnlwgt"]]<-NULL;w[["age"]]<-ordered(cut(w[["age"]],c(15,25,45,65,100)),labels=c("Young","Middle-aged“,"Senior","Old"));#把年齡數(shù)據(jù)變換成分類數(shù)據(jù)w[["education.num"]]<-ordered(cut(w[["education.num"]],c(0,9,13,16)),labels=c("low","Middle","up"));w[["hours.per.week"]]<-ordered(cut(w[["hours.per.week"]],c(0,25,40,60,168)),labels=c("Part-time","Full-time","Over-time","Workaholic"));w[["capital.gain"]]<-ordered(cut(w[["capital.gain"]],c(-Inf,0,median(w[["capital.gain"]][w[["capital.gain"]]>0]),Inf)),labels=c("None","Low","High"));w[["capital.loss"]]<-ordered(cut(w[["capital.loss"]],c(-Inf,0,median(w[["capital.loss"]][w[["capital.loss"]]>0]),Inf)),labels=c("none","low","high"));ww<-as(w,“transactions”);#轉(zhuǎn)換成交易型數(shù)據(jù)2.8關(guān)聯(lián)規(guī)則分析:程序統(tǒng)計(jì)軟件R作業(yè)——adult、babiesI數(shù)據(jù)共50頁(yè),您現(xiàn)在瀏覽的是第41頁(yè)!lhsrhssupportconfidencelift1{age=Middle-aged,marital.status=Married-civ-spouse,capital.gain=High,native.country=United-States}=>{class=>50K}0.010871901.00000004.1526592{age=Middle-aged,sex=Male,capital.gain=High,native.country=United-States}=>{class=>50K}0.012376770.99752484.1423803{age=Middle-aged,marital.status=Married-civ-spouse,capital.gain=High}=>{class=>50K}0.011762540.99739584.1418454{age=Middle-aged,marital.status=Married-civ-spouse,capital.gain=High,capital.loss=none}=>{class=>50K}0.011762540.99739584.1418455{workclass=Private,relationship=Husband,race=White,capital.gain=High}=>{class=>50K}0.011701110.99738224.141788中年、結(jié)婚配偶為平民、財(cái)產(chǎn)收益高的美國(guó)籍人,年收入一般會(huì)超過5萬(wàn)中年、財(cái)產(chǎn)收

溫馨提示

  • 1. 本站所有資源如無(wú)特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請(qǐng)下載最新的WinRAR軟件解壓。
  • 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請(qǐng)聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
  • 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁(yè)內(nèi)容里面會(huì)有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
  • 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
  • 5. 人人文庫(kù)網(wǎng)僅提供信息存儲(chǔ)空間,僅對(duì)用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對(duì)用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對(duì)任何下載內(nèi)容負(fù)責(zé)。
  • 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請(qǐng)與我們聯(lián)系,我們立即糾正。
  • 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶因使用這些下載資源對(duì)自己和他人造成任何形式的傷害或損失。

評(píng)論

0/150

提交評(píng)論