數(shù)據(jù)挖掘試驗(yàn)報(bào)告

上傳人：s*** IP屬地：天津上傳時(shí)間：2022-01-28 格式：DOC 頁(yè)數(shù)：17 大?。?06KB 積分：18 舉報(bào) 版權(quán)申訴

已閱讀5頁(yè)，還剩12頁(yè)未讀，繼續(xù)免費(fèi)閱讀

版權(quán)說(shuō)明：本文檔由用戶提供并上傳，收益歸屬內(nèi)容提供方，若內(nèi)容存在侵權(quán)，請(qǐng)進(jìn)行舉報(bào)或認(rèn)領(lǐng)

文檔簡(jiǎn)介

1、For personal use only in study and research; not for commercialuse中科大數(shù)據(jù)挖掘?qū)嶒?yàn)報(bào)告姓名樊濤聲班級(jí) 軟設(shè)一班學(xué)號(hào) SA15226248實(shí)驗(yàn)一K鄰近算法實(shí)驗(yàn)一實(shí)驗(yàn)內(nèi)容使用k近鄰算法改進(jìn)約會(huì)網(wǎng)站的配對(duì)效果。海倫使用約會(huì)網(wǎng)址尋找適合自己的約會(huì)對(duì)象，約會(huì)網(wǎng)站會(huì)推薦不同的人選。她將曾經(jīng) 交往過(guò)的的人總結(jié)為三種類型：1)不喜歡的人2)魅力一般的人3)極具魅力的人盡管發(fā)現(xiàn)了這些規(guī)律，但依然無(wú)法將約會(huì)網(wǎng)站提供的人歸入恰當(dāng)?shù)姆诸?。使用KNN算法，更好的幫助她將匹配對(duì) 象劃分到確切的分類中。二實(shí)驗(yàn)要求(1) 獨(dú)立完成kNN實(shí)驗(yàn)，基本實(shí)現(xiàn)

2、可預(yù)測(cè)的效果(2) 實(shí)驗(yàn)報(bào)告(3) 開(kāi)放性：可以自己增加數(shù)據(jù)或修改算法，實(shí)現(xiàn)更好的分類效果三實(shí)驗(yàn)步驟(1)數(shù)據(jù)源說(shuō)明實(shí)驗(yàn)給出的數(shù)據(jù)源為datingTestSet.txt，共有4列，每一列的屬性分別為： percentage of timespenting playing vedio games:frequent flied miles earned per year:liters of ice cream consumedper year; your attitude towars this people。通過(guò)分析數(shù)據(jù)源中的數(shù)據(jù)，得到規(guī)律，從而判斷一個(gè)人的前三項(xiàng)屬性來(lái)得出劃分海倫對(duì)他的態(tài)度

3、。(2) KNN 算法原理對(duì)未知屬性的某數(shù)據(jù)集中的每個(gè)點(diǎn)一次執(zhí)行以下操作1計(jì)算已知類別數(shù)據(jù)集中的每一個(gè)點(diǎn)和當(dāng)前點(diǎn)的距離2按照距離遞增依次排序3選取與當(dāng)前點(diǎn)距離最小的k個(gè)點(diǎn)4確定k個(gè)點(diǎn)所在類別的出現(xiàn)頻率5返回k個(gè)點(diǎn)出現(xiàn)頻率最高的點(diǎn)作為當(dāng)前點(diǎn)的分類(3) KNN 算法實(shí)現(xiàn)1利用python實(shí)現(xiàn)構(gòu)造分類器首先計(jì)算歐式距離然后選取距離最小的K個(gè)點(diǎn)代碼如下：def classify(inMat,dataSet,labels,k):dataSetSize=dataSet.shape0#KNN的算法核心就是歐式距離的計(jì)算，一下三行是計(jì)算待分類的點(diǎn)和訓(xùn)練集中的任一點(diǎn)的歐式距離diffMat=tile(inM

4、at,(dataSetSize,1)-dataSetsqDiffMat=diffMat*2distance=sqDiffMat.sum(axis=1)*0.5#接下來(lái)是一些統(tǒng)計(jì)工作sortedDistIndicies=distance.argsort() classCount= for i in range(k):labelName=labelssortedDistIndiciesiclassCountlabelName=classCount.get(labelName,0)+1;sortedClassCount=sorted(classCount.items(),key=operator.i

5、temgetter(1),reverse=True) returnsortedClassCount002解析數(shù)據(jù) 輸入文件名，將文件中的數(shù)據(jù)轉(zhuǎn)化為樣本矩陣，方便處理代碼如下：def file2Mat(testFileName,parammterNumber): fr=open(testFileName) lines=fr.readlines()lineNums=len(lines) resultMat=zeros(lineNums,parammterNumber) classLabelVector= for iin range(lineNums):line=linesi.strip() it

6、emMat=line.split(t) resultMati,:=itemMat0:parammterNumberclassLabelVector.append(itemMat-1) fr.close() return resultMat,classLabelVector;返回值為前三列屬性被寫入到resultMat二維數(shù)組中，第四列屬性作為標(biāo)簽寫入到classLableVector中3歸一化數(shù)據(jù)不同評(píng)價(jià)指標(biāo)往往具有不同的量綱和量綱單位，這樣的情況會(huì)影響到數(shù)據(jù)分析的結(jié)果，為了消除指標(biāo)之間的量綱影響，需要進(jìn)行數(shù)據(jù)標(biāo)準(zhǔn)化處理，使各指標(biāo)處于同一數(shù)量級(jí)。處理過(guò)程如下：defautoNorm(d

7、ataSet):minVals=dataSet.min(0)maxVals=dataSet.max(0)ranges=maxVals-minValsnormMat=zeros(shape(dataSet)size=normMat.shape0normMat=dataSet-tile(minVals,(size,1)normMat=normMat/tile(ranges,(size,1)returnnormMat,minVals,ranges4測(cè)試數(shù)據(jù)在利用KNN算法預(yù)測(cè)之前，通常只提供已有數(shù)據(jù)的90%作為訓(xùn)練樣本，使用其余的10%數(shù)據(jù)去測(cè)試分類器。注意10%測(cè)試數(shù)據(jù)是隨機(jī)選擇的，采用錯(cuò)誤率

8、來(lái)檢測(cè)分類器的性能。錯(cuò) 誤率太高說(shuō)明數(shù)據(jù)源出現(xiàn)問(wèn)題，此時(shí)需要重新考慮數(shù)據(jù)源的合理性。deftest(trainigSetFileName,testFileName):trianingMat,classLabel=file2Mat(trainigSetFileName,3)trianingMat,minVals,ranges=autoNorm(trianingMat)testMat,testLabel=file2Mat(testFileName,3) testSize=testMat.shape0 errorCount=0.0 for i inrange(testSize):result=cl

9、assify(testMati-minVals)/ranges,trianingMat,classLabel,3) if(result!=testLabeli):errorCount+=1.0 errorRate=errorCount/(float)(len(testLabel) return errorRate;5使用KNN算法進(jìn)行預(yù)測(cè) 如果第四步中的錯(cuò)誤率在課接受范圍內(nèi)，表示可以利用此數(shù)據(jù)源進(jìn)行預(yù)測(cè)。輸入前三項(xiàng) 屬性之后較為準(zhǔn)確的預(yù)測(cè)了分類。代碼如下：def classifyPerson():input a person , decide like or not, then updat

10、e the DB resultlist = not at all,littledoses,large doses percentTats = float(raw_input(input the person percentage of time playing videogames:)ffMiles = float(raw_input(flier miles in a year:) iceCream = float(raw_input(amount oficeCream consumed per year:) datingDataMat,datingLabels =file2matrix(da

11、tingTestSet.txt) normMat, ranges, minVals = autoNorm(datingDataMat)normPerson = (array(ffMiles,percentTats,iceCream)-minVals)/ranges result =classify0(normPerson, normMat, datingLabels, 3) print you will probably like this guyin: ,result#resultlistresult -1 #update the datingTestSet print update dat

12、ing DB tmp =t.join(repr(ffMiles),repr(percentTats),repr(iceCream),repr(result)+n withopen(datingTestSet2.txt,a) as fr:fr.write(tmp)四實(shí)驗(yàn)結(jié)果及分析本次實(shí)驗(yàn)結(jié)果截圖如下：在終端輸入python KNN.py命令開(kāi)始執(zhí)行KNN.py,分別得到了樣本測(cè)試的錯(cuò)誤率以及輸入數(shù)據(jù)后KNN算法的預(yù)測(cè)結(jié)果：從實(shí)驗(yàn)結(jié)果來(lái)看，本數(shù)據(jù)集的一共檢測(cè)的數(shù)據(jù)有200個(gè)，其中預(yù)測(cè)的和實(shí)際不相符的有16個(gè)，錯(cuò)誤率為8%，在可接受范圍之內(nèi)。由于檢測(cè)的數(shù)據(jù)集是隨機(jī)選取的，因此該數(shù)據(jù)比較可信。當(dāng)輸

13、入數(shù)據(jù)分別為900,40,80時(shí)，分類結(jié)果為didntlike，與數(shù)據(jù)集中給出的類似數(shù)據(jù)的分類一致。theclassifiercanebackwith:snidUDoses, therealanswer xs:snallDosestheclassifiercanebackwith :snallDoses, therealanswerIS：snallDosestheclassifiercanebackwith:largeDosesftherealanswer is:largeDosestheclassifiercanebackwith:snallDoses9therealanswer Is:sn

14、allDosestheclassifiercanebackwith:largeooses, therealanswer ts:largeDosestheclassifiercanebackwith:snallOoses, therealanswer is:snallDosestheclassifiercanebackwith:didntLike the ireal、answer Is:didntLiketheclassifiercanebackwith:snallDoses, therealanswerIS： didntLiketheclassifiercanebackwith:largeOo

15、ses, therealanswer Is:largeDosestheclassifiercanebackwith:dtdntLtke, the ieal(answer *Is:didntLiketheclasstfiercanebackwith:snaIlooses, therealanswerIS：largeDosestheclassifiercanebackwith:snallDoses, therealanswer is:snallDosestheclassifiercanebackwith:snallDoses, therealanswer is:snallDosesLhtclass

16、ifiercanebackwith:snallDoses. therealanswer Is:snallDosestheclassifiercanebackwith:largeooses, therealanswer is:largeDosestheclassifiercanebackwith:snallOoses, therealanswer is:snallDosesthetotal error rate is:8.00XX Fanshlon(ubuntu: -/Desktop10.0ttpie playing video ganes:90etestcount Is ZOO, errorc

17、ount is Input theperson percentage of filer ntles tn aycar：40 amount of iccCrean consuned peryou will probably like this guy updatedating DByear:80 in:didntlike實(shí)驗(yàn)二分組實(shí)驗(yàn)一實(shí)驗(yàn)內(nèi)容本次實(shí)驗(yàn)的實(shí)驗(yàn)內(nèi)容為利用數(shù)據(jù)挖掘的聚類算法實(shí)現(xiàn)對(duì)DBLP合作者的數(shù)據(jù)挖掘。DBLP收錄了國(guó)內(nèi)外學(xué)者發(fā)表的絕大多數(shù)論文，其收錄的論文按照文章類型等分類存儲(chǔ)在DBLP.xml文件中。通過(guò)聚類算法發(fā)現(xiàn)頻繁項(xiàng)集就可以很好的發(fā)掘出有哪些作者經(jīng)常在一起發(fā)表論文。

18、二實(shí)驗(yàn)要求（1）完成對(duì)DBLP數(shù)據(jù)集的采集和預(yù)處理，能從中提取出作者以及合作者的姓名（2）利用聚類算法完成對(duì)合作者的挖掘（3）實(shí)驗(yàn)報(bào)告三實(shí)驗(yàn)步驟（1）從 DBLP 數(shù)據(jù)集中提取作者信息首先從官網(wǎng)下載DBLP數(shù)據(jù)集dblp.xml.gz解壓后得到dblp.xml文件。用vim打開(kāi)dblp.xml發(fā) 現(xiàn) 所有的作者信息分布在以下這些屬性中：article,inproceedings,proceedings,book,incollection,phdthesis,mastersthesis,www。在這里使用python自帶的xml分析器解析該文件。代碼如下

19、：（其核心思想為，分析器在進(jìn)入上面那些屬性中的某一個(gè)時(shí)，標(biāo)記flag=1，然后將author屬性的內(nèi)容輸出到文件，退出時(shí)再標(biāo)記flag = 0,最后得到authors.txt文件）etauGthor.pyimport codecsfrom xml.sax import handler, make_parserpaper_tag = （article,inproceedings,proceedings,book,incollection,phdthesis,mastersthesis,www）class mHandler(handler.ContentHandler): def _in

20、it_(self,result): self.result = result self.flag = 0def startDocument(self):print Document Start defendDocument(self): printDocument Enddef startElement(self, name, attrs): if name = author:self.flag = 1def endElement(self, name): if name = author:self.result.write(,) self.flag = 0 if (name in paper

21、_tag) : self.result.write(rn)def characters(self, chrs): if self.flag: self.result.write(chrs)def parserDblpXml(source,result): handler = mHandler(result) parser = make_parser()parser.setContentHandler(handler)parser.parse(source)if _name_ = _main_: source = codecs.open(dblp.xml,r,utf-8) result =cod

22、ecs.open(authors.txt,w,utf-8) parserDblpXml(source,result) result.close()source.close()(2) 建立索引作者 ID讀取步驟1中得到的authors.txt文件，將其中不同的人名按照人名出現(xiàn)的次序編碼，存儲(chǔ)到文件authors_index.txt中，同時(shí)將編碼后的合作者列表寫入authors_encoded.txt文件。代碼如下：encoded.pyimport codecssource = codecs.open(authors.txt,r,utf-8)result = codecs.open(autho

23、rs_encoded.txt,w,utf-8)index = codecs.open(authors_index.txt,w,utf-8)index_dic = name_id = 0# build an index_dic, key - authorName value = id, countfor line in source: name_list = line.split(,) for name in name_list: if not (name = rn): if name inindex_dic: index_dicname1 +=1 else:index_dicname = na

24、me_id,1 index.write(name + urn) name_id += 1result.write(str(index_dicname0) + u,) result.write(rn)source.close() result.close() index.close()(3)構(gòu)建 FP-Tree 并得到頻繁項(xiàng)集FP-Tree算法的原理在這里不展開(kāi)講了，其核心思想分為2步，首先掃描數(shù)據(jù)庫(kù)得到FP-Tree然后再?gòu)臉?shù)上遞歸生成條件模式樹(shù)并上溯找到頻繁項(xiàng)集。代碼如下：def createTree(dataSet, minSup=1): #create FP-tree from data

25、set but dont mine freqDic = #go over dataSet twicefor trans in dataSet:#first pass counts frequency of occurancefor item in trans:freqDicitem = freqDic.get(item, 0) + dataSettrans headerTable = k:v for (k,v) infreqDic.iteritems() if v = minSup if len(headerTable) = 0: return None, None #if no items

26、meet minsupport -get outfor k in headerTable:headerTablek = headerTablek, None #reformat headerTable to use Node link#print headerTable: ,headerTable retTree = treeNode(Null Set, 1, None) #create tree fortranSet, count in dataSet.items(): #go through dataset 2nd timelocalD = for item in tranSet: #pu

27、t transaction items in orderif headerTable.get(item,0): localDitem = headerTableitem0 if len(localD) 0:orderedItems = v0 for v in sorted(localD.items(), key=lambda p: p1,reverse=True)updateTree(orderedItems, retTree, headerTable, count)#populate tree withordered freq itemsetreturn retTree, headerTab

28、le #return tree and header tabledef updateTree(items, inTree, headerTable, count):if items0 in inTree.children:#check if orderedItems0 in retTree.childreninTree.childrenitems0.inc(count) #incrament countelse:#add items0 to inTree.childreninTree.childrenitems0 = treeNode(items0, count, inTree) if hea

29、derTableitems01= None: #update header table headerTableitems01 = inTree.childrenitems0else:updateHeader(headerTableitems01, inTree.childrenitems0)if len(items) 1:#call updateTree() with remaining ordered items updateTree(items1:,inTree.childrenitems0, headerTable, count)def updateHeader(nodeToTest,

30、targetNode):#this version does not use recursionwhile (nodeToTest.nodeLink != None):#Do not use recursion to traverse a linkedlist!nodeToTest = nodeToTest.nodeLinknodeToTest.nodeLink = targetNodedef mineTree(inTree, headerTable, minSup, preFix, freqItemList):bigL = v0 for v in sorted(headerTable.ite

31、ms(), key=lambda p: p1)#(sort header table)for basePat in bigL: #start from bottom of header tablenewFreqSet = preFix.copy()newFreqSet.add(basePat)#print finalFrequent Item: ,newFreqSet #append to setif len(newFreqSet) 1:freqItemListfrozenset(newFreqSet) = headerTablebasePat0 condPattBases= findPref

32、ixPath(basePat, headerTablebasePat1) myCondTree, myHead =createTree(condPattBases, minSup) #print head from conditional tree: , myHead ifmyHead != None: #3. mine cond. FP-tree#print conditional tree for: ,newFreqSet#myCondTree.disp(1) mineTree(myCondTree, myHead, minSup, newFreqSet,freqItemList)四實(shí)驗(yàn)

33、結(jié)果及分析在選取頻繁度為40后發(fā)現(xiàn)，得到的結(jié)果非常多，總共2000多，為了分析的方便，進(jìn)一步提高頻繁度閾值為100，此時(shí)得到了111條記錄，按照合作者的共同支持度排序，部分截圖如下：-RljIOg HAuihriwlNiX斗 L)snx* P*理 UMR,昨店打4ZI4V?.血 WB.fewe .乳巧拙137川 KB. WMRtiMw “hm114431*幵叫B. WMQ.MT3PMgillw|3.M7pa-.*fli q.izjiChaii CWh曲葉 AdITSMa20 9TAMs.-iaiiE-B . WM寸.加9.95W!Awf cl-址 ad*匚Agi dHl2bbX|4J

34、33!. M71H. 7-W1El Tfl SIMil a. t4Tlus-EillaPtnx wluZ35B3534 ,*4ZD4-.3-M9 a. VStTd-UhK 曲卩5h=1lr Klhl-B2ia344S4S44-Bl-M9 .tzu.P WM4，利掘rifil h SylvmilHrZ Ld BldaRJh213J4I4e. TCHI7日.4EZ1nrkrM s otllah1LLh chfiqIPS4U323-4ALMAIS .49J7n4 Ml1 NilwLl.i wXLa1IT4 眄4e.iTM9.-44ETfi斗|審 il ITAMakiLM Ph3H.tnbKtlf I

35、Cd LdClt1TP2HJfl!. TP1n -5 irf?r ri ?-agkX*a ZnPrtrui ElaiTtiluKBI44.TM11s.ivith .-5d3O- d ILB4tHPAliaShmribn Tw*til43344上!*0.-4I55!TvifeUv DBII lUMhl IM33mt.im Jtlle.ES- fl.iLTT砸員 V Ellij!L.価3413314t.twa . ElSMjp r 電 h g-r y ssetemp): ssetempold = ssetemp numgroupold = numgroup print -,# Write to t

36、he file# count = 0 for groupmember in numgroupold:count += 1 fresult.write(n) fresult.write(-*500) fresult.write(nGroup-%d(%d) % (count, len(groupmember)for num in groupmember:fresult.write( %s % str(num) fresult.write(n) fresult.write(-*500)fresult.write(n)fresult.write(n)ssey.append(ssetempold) #

37、Calculate silhouette coefficientfor i in range(0, len(numgroupold):s = for j in range(0, len(numgroupoldi): temp = for k in range(0, len(numgroupold):temp.append(caldistancefromarray(numgroupoldij,numgroupoldk)a = tempi / len(numgroupoldi) sorted = tile(temp, 1).argsort() if(sorted0= i):d = 1else:d

38、= 0while(d 0):b = tempsortedd / len(numgroupoldsortedd) breakelse: d += 1 if(d = len(sorted):s.append(0) continue c = (b-a)/max(a, b) # print size1 = %d i = %d j =%d sorted0=%d sorted1=%d a = %s b = %s c = %s % (len(numgroupi),i, j, sorted0, sorted1,str(a), str(b), str(c) s.append(c)scy.append(tile(

39、s, 1).sum(axis=0)/len(s)fresult.close() plt.subplot(211) plt.plot(x, ssey) plt.ylabel(SSE) # plt.show() plt.subplot(212)plt.ylabel(SC) plt.xlabel(k) plt.plot(x, scy) plt.show()(3) Kmeans+算法原理k-means+算法選擇初始seeds的基本思想就是：初始的聚類中心之間的相互距離要盡可能的遠(yuǎn)。具體步驟如下：從輸入的數(shù)據(jù)點(diǎn)集合中隨機(jī)選擇一個(gè)點(diǎn)作為第一個(gè)聚類中心對(duì)于數(shù)據(jù)集中的每一個(gè)點(diǎn)x，計(jì)算它與最近聚類中心(指已選

40、擇的聚類中心)的距離D(x)選擇一個(gè)新的數(shù)據(jù)點(diǎn)作為新的聚類中心，選擇的原則是：D(x )較大的點(diǎn)，被選取作為聚類中心的概率較大重復(fù)2和3直到k個(gè)聚類中心被選出來(lái)利用這k個(gè)初始的聚類中心來(lái)運(yùn)行標(biāo)準(zhǔn)的k-means算法(4) Kmeans+算法實(shí)現(xiàn)代碼如下：def kmeansplus(data,mincluster,maxcluster):x = ssey = scy = fresult = open(resultfile2,wb) for i in range(mincluster,maxcluster+1):print nk = %d % issetempold = 0 ssetemp =

41、0numgroupold = x.append(i)fresult.write(str(i) + :n) group = numgroup = masslist = getmasslist(data,i)for j in range(0, i):g = group.append(g) g = numgroup.append(g)ssetemp = cal(data, masslist, group, numgroup) numgroupold = numgroupcount = 0for groupmember in numgroupold:count += 1fresult.write(n)

42、 fresult.write(-*500) fresult.write(nGroup-%d(%d) % (count,len(groupmember) for num in groupmember:fresult.write( %s % str(num)fresult.write(n) fresult.write(-*500) fresult.write(n)fresult.write(n) ssey.append(ssetemp)# Calculate silhouette coefficientfor i in range(0, len(numgroupold):s = for j in range(0, len(numgroupoldi):temp = for k in range(0, len(numgroupold):temp.append(caldistancefromarray(numgroupoldij, numgroupoldk)a = tempi / len(numgroupoldi)sorted = tile(temp, 1).argsort()if(sorted0 = i):d = 1else:d = 0while(d 0):

人人文庫(kù)> 全部分類> 應(yīng)用文書 > 作業(yè)報(bào)告

溫馨提示

1. 本站所有資源如無(wú)特殊說(shuō)明，都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請(qǐng)下載最新的WinRAR軟件解壓。
2. 本站的文檔不包含任何第三方提供的附件圖紙等，如果需要附件，請(qǐng)聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
3. 本站RAR壓縮包中若帶圖紙，網(wǎng)頁(yè)內(nèi)容里面會(huì)有圖紙預(yù)覽，若沒(méi)有圖紙預(yù)覽就沒(méi)有圖紙。
4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
5. 人人文庫(kù)網(wǎng)僅提供信息存儲(chǔ)空間，僅對(duì)用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理，對(duì)用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯，并不能對(duì)任何下載內(nèi)容負(fù)責(zé)。
6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容，請(qǐng)與我們聯(lián)系，我們立即糾正。
7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶因使用這些下載資源對(duì)自己和他人造成任何形式的傷害或損失。

數(shù)據(jù)挖掘試驗(yàn)報(bào)告

文檔簡(jiǎn)介

溫馨提示

最新文檔

評(píng)論

數(shù)據(jù)挖掘試驗(yàn)報(bào)告

文檔簡(jiǎn)介

溫馨提示

最新文檔

評(píng)論

相關(guān)文檔