兩步聚類-BIRCH算法(1)_第1頁(yè)
兩步聚類-BIRCH算法(1)_第2頁(yè)
兩步聚類-BIRCH算法(1)_第3頁(yè)
兩步聚類-BIRCH算法(1)_第4頁(yè)
兩步聚類-BIRCH算法(1)_第5頁(yè)
已閱讀5頁(yè),還剩28頁(yè)未讀, 繼續(xù)免費(fèi)閱讀

下載本文檔

版權(quán)說(shuō)明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請(qǐng)進(jìn)行舉報(bào)或認(rèn)領(lǐng)

文檔簡(jiǎn)介

1、BIRCH: Balanced Iterative Reducing and Clustering using HierarchiesTian Zhang, Raghu Ramakrishnan, Miron LivnyPresented by Zhao Li2009, SpringMarch 6, 20222OutlineIntroduction to ClusteringMain Techniques in ClusteringHybrid Algorithm: BIRCHExample of the BIRCH AlgorithmExperimental resultsConclusio

2、nsClustering IntroductionData clustering concerns how to group a set of objects based on their similarity of attributes and/or their proximity in the vector space. Main methodsPartitioning : K-MeansHierarchical : BIRCH,ROCK,Density-based: DBSCAN,A good clustering method will produce high quality clu

3、sters with high intra-class similarity low inter-class similarityMarch 6, 20223Main Techniques (1) Partitioning Clustering (K-Means)step.1March 6, 20224initial centerinitial centerinitial centerK-Means ExampleStep.2March 6, 20225xxxnew center after 1st iterationnew center after 1st iterationnew cent

4、er after 1st iterationK-Means ExampleStep.3March 6, 20226new center after 2nd iterationnew center after 2nd iterationnew center after 2nd iterationMain Techniques (2) Hierarchical ClusteringMarch 6, 20227Multilevel clustering: level 1 has n clusters level n has one cluster, or upside down.Agglomerat

5、ive HC: starts with singleton and merge clusters (bottom-up).Divisive HC: starts with one sample and split clusters (top-down).DendrogramAgglomerative HC ExampleMarch 6, 20228Nearest Neighbor Level 2, k = 7 clusters.March 6, 20229Nearest Neighbor, Level 3, k = 6 clusters.March 6, 202210Nearest Neigh

6、bor, Level 4, k = 5 clusters.March 6, 202211Nearest Neighbor, Level 5, k = 4 clusters.March 6, 202212Nearest Neighbor, Level 6, k = 3 clusters.March 6, 202213Nearest Neighbor, Level 7, k = 2 clusters.March 6, 202214Nearest Neighbor, Level 8, k = 1 cluster.RemarksMarch 6, 202215Partitioning Clusterin

7、gHierarchical ClusteringTime ComplexityO(n) O(n2log n)ProsEasy to use and Relatively efficientOutputs a dendrogram that is desired in many applications.ConsSensitive to initialization; bad initialization might lead to bad results.Need to store all data in memory.higher time complexity;Need to store

8、all data in memory.March 6, 202216Introduction to BIRCHDesigned for very large data setsTime and memory are limitedIncremental and dynamic clustering of incoming objectsOnly one scan of data is necessaryDoes not need the whole data set in advanceTwo key phases:Scans the database to build an in-memor

9、y treeApplies clustering algorithm to cluster the leaf nodesMarch 6, 202217Similarity Metric(1)Given a cluster of instances , we define:Centroid:Radius: average distance from member points to centroidDiameter: average pair-wise distance within a clusterSimilarity Metric(2)centroid Euclidean distance

10、:centroid Manhattan distance:average inter-cluster:average intra-cluster:variance increase:March 6, 202218March 6, 202219Clustering FeatureThe Birch algorithm builds a dendrogram called clustering feature tree (CF tree) while scanning the data set.Each entry in the CF tree represents a cluster of ob

11、jects and is characterized by a 3-tuple: (N, LS, SS), where N is the number of objects in the cluster and LS, SS are defined in the following. NPiNPiiiPSSPLS2March 6, 202220Properties of Clustering FeatureCF entry is more compactStores significantly less than all of the data points in the sub-cluste

12、rA CF entry has sufficient information to calculate D0-D4Additivity theorem allows us to merge sub-clusters incrementally & consistentlyMarch 6, 202221CF-TreeEach non-leaf node has at most B entriesEach leaf node has at most L CF entries, each of which satisfies threshold TNode size is determined by

13、 dimensionality of data space and input parameter P (page size)March 6, 202222CF-Tree InsertionRecurse down from root, find the appropriate leafFollow the closest-CF path, w.r.t. D0 / / D4Modify the leafIf the closest-CF leaf cannot absorb, make a new CF entry. If there is no room for new leaf, spli

14、t the parent nodeTraverse back Updating CFs on the path or splitting nodesMarch 6, 202223CF-Tree RebuildingIf we run out of space, increase threshold TBy increasing the threshold, CFs absorb more dataRebuilding pushes CFs overThe larger T allows different CFs to group togetherReducibility theoremInc

15、reasing T will result in a CF-tree smaller than the originalRebuilding needs at most h extra pages of memoryExample of BIRCHMarch 6, 202224RootLN1LN2LN3LN1LN2 LN3sc1sc2sc3sc4sc5sc6sc7sc1sc2sc3sc4sc5sc6sc7sc8sc8New subclusterInsertion Operation in BIRCHMarch 6, 202225If the branching factor of a leaf

16、 node can not exceed 3, then LN1 is split.RootLN1”LN2LN3LN1LN2LN3sc1sc2sc3sc4sc5sc6sc7sc1sc2sc3sc4sc5sc6sc7sc8sc8LN1LN1”March 6, 202226If the branching factor of a non-leaf node can notexceed 3, then the root is split and the height ofthe CF Tree increases by one.RootLN1”LN2LN3LN1LN2LN3sc1sc2sc3sc4s

17、c5sc6sc7sc1sc2sc3sc4sc5sc6sc7sc8sc8LN1LN1”NLN1NLN2March 6, 202227BIRCH OverviewMarch 6, 202228Experimental ResultsInput parameters:Memory (M): 5% of data setDisk space (R): 20% of MDistance equation: D2Quality equation: weighted average diameter (D)Initial threshold (T): 0.0Page size (P): 1024 bytes

18、March 6, 202229Experimental ResultsKMEANS clusteringDSTimeD# ScanDSTimeD# Scan143.92.092891o33.81.97197213.24.43512o12.74.2029332.93.661873o36.04.35241DSTimeD# ScanDSTimeD# Scan111.51.8721o13.61.872210.71.9922o12.11.992311.43.9523o12.23.992BIRCH clusteringMarch 6, 202230ConclusionsA CF tree is a hei

19、ght-balanced tree that stores the clustering features for a hierarchical clustering.Given a limited amount of main memory, BIRCH can minimize the time required for I/O.BIRCH is a scalable clustering algorithm with respect to the number of objects, and good quality of clustering of the data.March 6, 202231Exam QuestionsWhat is the main limitation of BIRCH?Sinc

溫馨提示

  • 1. 本站所有資源如無(wú)特殊說(shuō)明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請(qǐng)下載最新的WinRAR軟件解壓。
  • 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請(qǐng)聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
  • 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁(yè)內(nèi)容里面會(huì)有圖紙預(yù)覽,若沒(méi)有圖紙預(yù)覽就沒(méi)有圖紙。
  • 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
  • 5. 人人文庫(kù)網(wǎng)僅提供信息存儲(chǔ)空間,僅對(duì)用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對(duì)用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對(duì)任何下載內(nèi)容負(fù)責(zé)。
  • 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請(qǐng)與我們聯(lián)系,我們立即糾正。
  • 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶因使用這些下載資源對(duì)自己和他人造成任何形式的傷害或損失。

最新文檔

評(píng)論

0/150

提交評(píng)論