




版權說明:本文檔由用戶提供并上傳,收益歸屬內容提供方,若內容存在侵權,請進行舉報或認領
文檔簡介
1、BIRCH: Balanced Iterative Reducing and Clustering using HierarchiesTian Zhang, Raghu Ramakrishnan, Miron LivnyPresented by Zhao Li2009, SpringMarch 6, 20222OutlineIntroduction to ClusteringMain Techniques in ClusteringHybrid Algorithm: BIRCHExample of the BIRCH AlgorithmExperimental resultsConclusio
2、nsClustering IntroductionData clustering concerns how to group a set of objects based on their similarity of attributes and/or their proximity in the vector space. Main methodsPartitioning : K-MeansHierarchical : BIRCH,ROCK,Density-based: DBSCAN,A good clustering method will produce high quality clu
3、sters with high intra-class similarity low inter-class similarityMarch 6, 20223Main Techniques (1) Partitioning Clustering (K-Means)step.1March 6, 20224initial centerinitial centerinitial centerK-Means ExampleStep.2March 6, 20225xxxnew center after 1st iterationnew center after 1st iterationnew cent
4、er after 1st iterationK-Means ExampleStep.3March 6, 20226new center after 2nd iterationnew center after 2nd iterationnew center after 2nd iterationMain Techniques (2) Hierarchical ClusteringMarch 6, 20227Multilevel clustering: level 1 has n clusters level n has one cluster, or upside down.Agglomerat
5、ive HC: starts with singleton and merge clusters (bottom-up).Divisive HC: starts with one sample and split clusters (top-down).DendrogramAgglomerative HC ExampleMarch 6, 20228Nearest Neighbor Level 2, k = 7 clusters.March 6, 20229Nearest Neighbor, Level 3, k = 6 clusters.March 6, 202210Nearest Neigh
6、bor, Level 4, k = 5 clusters.March 6, 202211Nearest Neighbor, Level 5, k = 4 clusters.March 6, 202212Nearest Neighbor, Level 6, k = 3 clusters.March 6, 202213Nearest Neighbor, Level 7, k = 2 clusters.March 6, 202214Nearest Neighbor, Level 8, k = 1 cluster.RemarksMarch 6, 202215Partitioning Clusterin
7、gHierarchical ClusteringTime ComplexityO(n) O(n2log n)ProsEasy to use and Relatively efficientOutputs a dendrogram that is desired in many applications.ConsSensitive to initialization; bad initialization might lead to bad results.Need to store all data in memory.higher time complexity;Need to store
8、all data in memory.March 6, 202216Introduction to BIRCHDesigned for very large data setsTime and memory are limitedIncremental and dynamic clustering of incoming objectsOnly one scan of data is necessaryDoes not need the whole data set in advanceTwo key phases:Scans the database to build an in-memor
9、y treeApplies clustering algorithm to cluster the leaf nodesMarch 6, 202217Similarity Metric(1)Given a cluster of instances , we define:Centroid:Radius: average distance from member points to centroidDiameter: average pair-wise distance within a clusterSimilarity Metric(2)centroid Euclidean distance
10、:centroid Manhattan distance:average inter-cluster:average intra-cluster:variance increase:March 6, 202218March 6, 202219Clustering FeatureThe Birch algorithm builds a dendrogram called clustering feature tree (CF tree) while scanning the data set.Each entry in the CF tree represents a cluster of ob
11、jects and is characterized by a 3-tuple: (N, LS, SS), where N is the number of objects in the cluster and LS, SS are defined in the following. NPiNPiiiPSSPLS2March 6, 202220Properties of Clustering FeatureCF entry is more compactStores significantly less than all of the data points in the sub-cluste
12、rA CF entry has sufficient information to calculate D0-D4Additivity theorem allows us to merge sub-clusters incrementally & consistentlyMarch 6, 202221CF-TreeEach non-leaf node has at most B entriesEach leaf node has at most L CF entries, each of which satisfies threshold TNode size is determined by
13、 dimensionality of data space and input parameter P (page size)March 6, 202222CF-Tree InsertionRecurse down from root, find the appropriate leafFollow the closest-CF path, w.r.t. D0 / / D4Modify the leafIf the closest-CF leaf cannot absorb, make a new CF entry. If there is no room for new leaf, spli
14、t the parent nodeTraverse back Updating CFs on the path or splitting nodesMarch 6, 202223CF-Tree RebuildingIf we run out of space, increase threshold TBy increasing the threshold, CFs absorb more dataRebuilding pushes CFs overThe larger T allows different CFs to group togetherReducibility theoremInc
15、reasing T will result in a CF-tree smaller than the originalRebuilding needs at most h extra pages of memoryExample of BIRCHMarch 6, 202224RootLN1LN2LN3LN1LN2 LN3sc1sc2sc3sc4sc5sc6sc7sc1sc2sc3sc4sc5sc6sc7sc8sc8New subclusterInsertion Operation in BIRCHMarch 6, 202225If the branching factor of a leaf
16、 node can not exceed 3, then LN1 is split.RootLN1”LN2LN3LN1LN2LN3sc1sc2sc3sc4sc5sc6sc7sc1sc2sc3sc4sc5sc6sc7sc8sc8LN1LN1”March 6, 202226If the branching factor of a non-leaf node can notexceed 3, then the root is split and the height ofthe CF Tree increases by one.RootLN1”LN2LN3LN1LN2LN3sc1sc2sc3sc4s
17、c5sc6sc7sc1sc2sc3sc4sc5sc6sc7sc8sc8LN1LN1”NLN1NLN2March 6, 202227BIRCH OverviewMarch 6, 202228Experimental ResultsInput parameters:Memory (M): 5% of data setDisk space (R): 20% of MDistance equation: D2Quality equation: weighted average diameter (D)Initial threshold (T): 0.0Page size (P): 1024 bytes
18、March 6, 202229Experimental ResultsKMEANS clusteringDSTimeD# ScanDSTimeD# Scan143.92.092891o33.81.97197213.24.43512o12.74.2029332.93.661873o36.04.35241DSTimeD# ScanDSTimeD# Scan111.51.8721o13.61.872210.71.9922o12.11.992311.43.9523o12.23.992BIRCH clusteringMarch 6, 202230ConclusionsA CF tree is a hei
19、ght-balanced tree that stores the clustering features for a hierarchical clustering.Given a limited amount of main memory, BIRCH can minimize the time required for I/O.BIRCH is a scalable clustering algorithm with respect to the number of objects, and good quality of clustering of the data.March 6, 202231Exam QuestionsWhat is the main limitation of BIRCH?Sinc
溫馨提示
- 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
- 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯系上傳者。文件的所有權益歸上傳用戶所有。
- 3. 本站RAR壓縮包中若帶圖紙,網頁內容里面會有圖紙預覽,若沒有圖紙預覽就沒有圖紙。
- 4. 未經權益所有人同意不得將文件中的內容挪作商業(yè)或盈利用途。
- 5. 人人文庫網僅提供信息存儲空間,僅對用戶上傳內容的表現方式做保護處理,對用戶上傳分享的文檔內容本身不做任何修改或編輯,并不能對任何下載內容負責。
- 6. 下載文件中如有侵權或不適當內容,請與我們聯系,我們立即糾正。
- 7. 本站不保證下載資源的準確性、安全性和完整性, 同時也不承擔用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。
最新文檔
- 利用區(qū)塊鏈技術推動商業(yè)創(chuàng)新的IP激勵與保護策略研究
- 從理論到實踐-基于區(qū)塊連技術的供應錠管理安全性分析與探討
- 安全聯邦學習框架-全面剖析
- 企業(yè)自然災害應急救援演練計劃
- 建筑行業(yè)新員工安全培訓計劃
- 主題O-品牌故事在廣告攝影中的運用-全面剖析
- 燃氣行業(yè)節(jié)能減排-全面剖析
- 辦公室溝通藝術提升工作效率的秘訣
- 區(qū)塊鏈如何助力醫(yī)療記錄的高效流轉與存儲
- 婦炎潔與其他婦科產品的比較研究-全面剖析
- 【圖文】GB8624-2012建筑材料及制品燃燒性能分級(精)
- 高壓配電安裝工程施工組織設計
- 小學數學-課前三分鐘.ppt
- 缺血性腦卒中患者血壓管理之路
- 鋼纖維混凝土檢查井蓋J
- 遼寧工程技術大學開題報告示例
- 河北省初中生綜合素質評價實施
- 各種液體粘度表
- 德國化學成分牌號與DIN17007系統的數字材料號對照表[1]
- 房屋租賃合以裝修費抵租金
- 22-1附件1:國家電網公司班組建設管理標準
評論
0/150
提交評論