




版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請進(jìn)行舉報(bào)或認(rèn)領(lǐng)
文檔簡介
12January2023DataMining:ConceptsandTechniques1DataMining:
ConceptsandTechniques
—Chapter7—JiaweiHanDepartmentofComputerScienceUniversityofIllinoisatUrbana-Champaign/~hanj?2006JiaweiHanandMichelineKamber,Allrightsreserved12January2023DataMining:ConceptsandTechniques2Chapter7.ClusterAnalysisWhatisClusterAnalysis?TypesofDatainClusterAnalysisACategorizationofMajorClusteringMethodsPartitioningMethodsHierarchicalMethodsDensity-BasedMethodsGrid-BasedMethodsModel-BasedMethodsClusteringHigh-DimensionalDataConstraint-BasedClusteringOutlierAnalysisSummary12January2023DataMining:ConceptsandTechniques3WhatisClusterAnalysis?Cluster:acollectionofdataobjectsSimilartooneanotherwithinthesameclusterDissimilartotheobjectsinotherclustersClusteranalysisFindingsimilaritiesbetweendataaccordingtothecharacteristicsfoundinthedataandgroupingsimilardataobjectsintoclustersUnsupervisedlearning:nopredefinedclassesTypicalapplicationsAsastand-alonetooltogetinsightintodatadistributionAsapreprocessingstepforotheralgorithms12January2023DataMining:ConceptsandTechniques4Clustering:RichApplicationsandMultidisciplinaryEfforts
PatternRecognitionSpatialDataAnalysisCreatethematicmapsinGISbyclusteringfeaturespacesDetectspatialclustersorforotherspatialminingtasksImageProcessingEconomicScience(especiallymarketresearch)WWWDocumentclassificationClusterWeblogdatatodiscovergroupsofsimilaraccesspatterns12January2023DataMining:ConceptsandTechniques5ExamplesofClusteringApplicationsMarketing:Helpmarketersdiscoverdistinctgroupsintheircustomerbases,andthenusethisknowledgetodeveloptargetedmarketingprogramsLanduse:IdentificationofareasofsimilarlanduseinanearthobservationdatabaseInsurance:IdentifyinggroupsofinsurancepolicyholderswithahighaverageclaimcostCity-planning:Identifyinggroupsofhousesaccordingtotheirhousetype,value,andgeographicallocation12January2023DataMining:ConceptsandTechniques6Quality:WhatIsGoodClustering?Agoodclusteringmethodwillproducehighqualityclusterswithhighintra-classsimilaritylowinter-classsimilarityThequalityofaclusteringresultdependsonboththesimilaritymeasureusedbythemethodanditsimplementationThequalityofaclusteringmethodisalsomeasuredbyitsabilitytodiscoversomeorallofthehiddenpatterns12January2023DataMining:ConceptsandTechniques7MeasuretheQualityofClusteringDissimilarity/Similaritymetric:Similarityisexpressedintermsofadistancefunction,typicallymetric:d(i,j)Thereisaseparate“quality〞functionthatmeasuresthe“goodness〞ofacluster.Thedefinitionsofdistancefunctionsareusuallyverydifferentforinterval-scaled,boolean,categorical,ordinal,ratio,andvectorvariables.Weightsshouldbeassociatedwithdifferentvariablesbasedonapplicationsanddatasemantics.Itishardtodefine“similarenough〞or“goodenough〞theansweristypicallyhighlysubjective.12January2023DataMining:ConceptsandTechniques8RequirementsofClusteringinDataMiningScalabilityAbilitytodealwithdifferenttypesofattributesAbilitytohandledynamicdataDiscoveryofclusterswitharbitraryshapeMinimalrequirementsfordomainknowledgetodetermineinputparametersAbletodealwithnoiseandoutliersInsensitivetoorderofinputrecordsHighdimensionalityIncorporationofuser-specifiedconstraintsInterpretabilityandusability12January2023DataMining:ConceptsandTechniques9Chapter7.ClusterAnalysisWhatisClusterAnalysis?TypesofDatainClusterAnalysisACategorizationofMajorClusteringMethodsPartitioningMethodsHierarchicalMethodsDensity-BasedMethodsGrid-BasedMethodsModel-BasedMethodsClusteringHigh-DimensionalDataConstraint-BasedClusteringOutlierAnalysisSummary12January2023DataMining:ConceptsandTechniques10DataStructuresDatamatrix(twomodes)Dissimilaritymatrix(onemode)12January2023DataMining:ConceptsandTechniques11TypeofdatainclusteringanalysisInterval-scaledvariablesBinaryvariablesNominal,ordinal,andratiovariablesVariablesofmixedtypes12January2023DataMining:ConceptsandTechniques12Interval-valuedvariablesStandardizedataCalculatethemeanabsolutedeviation:whereCalculatethestandardizedmeasurement(z-score)Usingmeanabsolutedeviationismorerobustthanusingstandarddeviation12January2023DataMining:ConceptsandTechniques13SimilarityandDissimilarityBetweenObjectsDistancesarenormallyusedtomeasurethesimilarityordissimilaritybetweentwodataobjectsSomepopularonesinclude:Minkowskidistance:wherei=(xi1,xi2,…,xip)andj=(xj1,xj2,…,xjp)aretwop-dimensionaldataobjects,andqisapositiveintegerq=1:Manhattandistance(L1distance,1-normofvector)Aka.cityblockdistance12January2023DataMining:ConceptsandTechniques14SimilarityandDissimilarityBetweenObjects(Cont.)q=2:
Euclideandistance(L2distance,2-normofvector)Propertiesd(i,j)
0d(i,i)
=0d(i,j)
=d(j,i)d(i,j)
d(i,k)
+d(k,j)Also,onecanuseweighteddistance,parametricPearsonproductmomentcorrelation,orotherdisimilaritymeasures12January2023DataMining:ConceptsandTechniques15BinaryVariablesAtableforbinarydataDistancemeasureforsymmetricbinaryvariables:Distancemeasureforasymmetricbinaryvariables:Jaccardcoefficient(similaritymeasureforasymmetricbinaryvariables):ObjectiObjectj12January2023DataMining:ConceptsandTechniques16DissimilaritybetweenBinaryVariablesExampletheattributesareasymmetricbinary12January2023DataMining:ConceptsandTechniques17NominalVariablesAgeneralizationofthebinaryvariableinthatitcantakemorethan2states,e.g.,red,yellow,blue,greenMethod1:Simplematchingm:#ofmatches,p:total#ofvariablesMethod2:usealargenumberofbinaryvariablescreatinganewbinaryvariableforeachoftheMnominalstates12January2023DataMining:ConceptsandTechniques18OrdinalVariablesAnordinalvariablecanbediscreteorcontinuousOrderisimportant,e.g.,rankCanbetreatedlikeinterval-scaledreplacexif
bytheirrankmaptherangeofeachvariableonto[0,1]byreplacingi-thobjectinthef-thvariablebycomputethedissimilarityusingmethodsforinterval-scaledvariables12January2023DataMining:ConceptsandTechniques19Ratio-ScaledVariablesRatio-scaledvariable:apositivemeasurementonanonlinearscale,approximatelyatexponentialscale, suchasAeBtorAe-Bt
Methods:treatthemlikeinterval-scaledvariables—notagoodchoice!(why?—thescalecanbedistorted)applylogarithmictransformationyif=log(xif)treatthemascontinuousordinaldatatreattheirrankasinterval-scaled12January2023DataMining:ConceptsandTechniques20VariablesofMixedTypesAdatabasemaycontainallthesixtypesofvariablessymmetricbinary,asymmetricbinary,nominal,ordinal,intervalandratioOnemayuseaweightedformulatocombinetheireffectsfisbinaryornominal:dij(f)=0ifxif=xjf,ordij(f)=1otherwisefisinterval-based:usethenormalizeddistancefisordinalorratio-scaledcomputeranksrifandandtreatzifasinterval-scaled12January2023DataMining:ConceptsandTechniques21VectorObjectsVectorobjects:keywordsindocuments,genefeaturesinmicro-arrays,etc.Broadapplications:informationretrieval,biologictaxonomy,etc.CosinemeasureAvariant:Tanimotocoefficient12January2023DataMining:ConceptsandTechniques22Chapter7.ClusterAnalysisWhatisClusterAnalysis?TypesofDatainClusterAnalysisACategorizationofMajorClusteringMethodsPartitioningMethodsHierarchicalMethodsDensity-BasedMethodsGrid-BasedMethodsModel-BasedMethodsClusteringHigh-DimensionalDataConstraint-BasedClusteringOutlierAnalysisSummary12January2023DataMining:ConceptsandTechniques23MajorClusteringApproaches(I)Partitioningapproach:Constructvariouspartitionsandthenevaluatethembysomecriterion,e.g.,minimizingthesumofsquareerrorsTypicalmethods:k-means,k-medoids,CLARANSHierarchicalapproach:Createahierarchicaldecompositionofthesetofdata(orobjects)usingsomecriterionTypicalmethods:Diana,Agnes,BIRCH,ROCK,CAMELEONDensity-basedapproach:BasedonconnectivityanddensityfunctionsTypicalmethods:DBSACN,OPTICS,DenClue12January2023DataMining:ConceptsandTechniques24MajorClusteringApproaches(II)Grid-basedapproach:basedonamultiple-levelgranularitystructureTypicalmethods:STING,WaveCluster,CLIQUEModel-based:AmodelishypothesizedforeachoftheclustersandtriestofindthebestfitofthatmodeltoeachotherTypicalmethods:
EM,SOM,COBWEBFrequentpattern-based:BasedontheanalysisoffrequentpatternsTypicalmethods:pClusterUser-guidedorconstraint-based:Clusteringbyconsideringuser-specifiedorapplication-specificconstraintsTypicalmethods:COD(obstacles),constrainedclustering12January2023DataMining:ConceptsandTechniques25TypicalAlternativestoCalculatetheDistancebetweenClustersSinglelink:smallestdistancebetweenanelementinoneclusterandanelementintheother,i.e.,dis(Ki,Kj)=min(tip,tjq)Completelink:largestdistancebetweenanelementinoneclusterandanelementintheother,i.e.,dis(Ki,Kj)=max(tip,tjq)Average:avgdistancebetweenanelementinoneclusterandanelementintheother,i.e.,dis(Ki,Kj)=avg(tip,tjq)Centroid:distancebetweenthecentroidsoftwoclusters,i.e.,dis(Ki,Kj)=dis(Ci,Cj)Medoid:distancebetweenthemedoidsoftwoclusters,i.e.,dis(Ki,Kj)=dis(Mi,Mj)Medoid:onechosen,centrallylocatedobjectinthecluster12January2023DataMining:ConceptsandTechniques26Centroid,RadiusandDiameterofaCluster(fornumericaldatasets)Centroid:the“middle〞ofaclusterRadius:squarerootofaveragedistancefromanypointoftheclustertoitscentroidDiameter:squarerootofaveragemeansquareddistancebetweenallpairsofpointsinthecluster12January2023DataMining:ConceptsandTechniques27Chapter7.ClusterAnalysisWhatisClusterAnalysis?TypesofDatainClusterAnalysisACategorizationofMajorClusteringMethodsPartitioningMethodsHierarchicalMethodsDensity-BasedMethodsGrid-BasedMethodsModel-BasedMethodsClusteringHigh-DimensionalDataConstraint-BasedClusteringOutlierAnalysisSummary12January2023DataMining:ConceptsandTechniques28PartitioningAlgorithms:BasicConceptPartitioningmethod:ConstructapartitionofadatabaseDofnobjectsintoasetofkclusters,s.t.,minsumofsquareddistanceGivenak,findapartitionofkclustersthatoptimizesthechosenpartitioningcriterionGlobaloptimal:exhaustivelyenumerateallpartitionsHeuristicmethods:k-meansandk-medoidsalgorithmsk-means(MacQueen’67):Eachclusterisrepresentedbythecenteroftheclusterk-medoidsorPAM(Partitionaroundmedoids)(Kaufman&Rousseeuw’87):Eachclusterisrepresentedbyoneoftheobjectsinthecluster12January2023DataMining:ConceptsandTechniques29TheK-MeansClusteringMethod
Givenk,thek-meansalgorithmisimplementedinfoursteps:PartitionobjectsintoknonemptysubsetsComputeseedpointsasthecentroidsoftheclustersofthecurrentpartition(thecentroidisthecenter,i.e.,meanpoint,ofthecluster)AssigneachobjecttotheclusterwiththenearestseedpointGobacktoStep2,stopwhennomorenewassignment12January2023DataMining:ConceptsandTechniques30TheK-MeansClusteringMethod
Example012345678910012345678910012345678910012345678910K=2ArbitrarilychooseKobjectasinitialclustercenterAssigneachobjectstomostsimilarcenterUpdatetheclustermeansUpdatetheclustermeansreassignreassign12January2023DataMining:ConceptsandTechniques31CommentsontheK-MeansMethodStrength:
Relativelyefficient:O(tkn),wherenis#objects,kis#clusters,andtis#iterations.Normally,k,t<<n.Comparing:PAM:O(k(n-k)2),CLARA:O(ks2+k(n-k))Comment:Oftenterminatesatalocaloptimum.WeaknessApplicableonlywhenmeanisdefined,thenwhataboutcategoricaldata?Needtospecifyk,thenumberofclusters,inadvanceUnabletohandlenoisydataandoutliersNotsuitabletodiscoverclusterswithnon-convexshapes12January2023DataMining:ConceptsandTechniques32VariationsoftheK-MeansMethodAfewvariantsofthek-meanswhichdifferinSelectionoftheinitialkmeansDissimilaritycalculationsStrategiestocalculateclustermeansHandlingcategoricaldata:k-modes(Huang’98)ReplacingmeansofclusterswithmodesUsingnewdissimilaritymeasurestodealwithcategoricalobjectsUsingafrequency-basedmethodtoupdatemodesofclustersAmixtureofcategoricalandnumericaldata:k-prototypemethod12January2023DataMining:ConceptsandTechniques33WhatIstheProblemoftheK-MeansMethod?Thek-meansalgorithmissensitivetooutliers!Sinceanobjectwithanextremelylargevaluemaysubstantiallydistortthedistributionofthedata.K-Medoids:Insteadoftakingthemeanvalueoftheobjectinaclusterasareferencepoint,medoidscanbeused,whichisthemostcentrallylocatedobjectinacluster.01234567891001234567891001234567891001234567891012January2023DataMining:ConceptsandTechniques34TheK-Medoids
ClusteringMethodFindrepresentativeobjects,calledmedoids,inclustersPAM(PartitioningAroundMedoids,1987)startsfromaninitialsetofmedoidsanditerativelyreplacesoneofthemedoidsbyoneofthenon-medoidsifitimprovesthetotaldistanceoftheresultingclusteringPAMworkseffectivelyforsmalldatasets,butdoesnotscalewellforlargedatasetsCLARA(Kaufmann&Rousseeuw,1990)CLARANS(Ng&Han,1994):Randomizedsampling12January2023DataMining:ConceptsandTechniques35ATypicalK-MedoidsAlgorithm(PAM)TotalCost=20012345678910012345678910K=2ArbitrarychoosekobjectasinitialmedoidsAssigneachremainingobjecttonearestmedoidsRandomlyselectanonmedoidobject,OramdomComputetotalcostofswapping012345678910012345678910TotalCost=26SwappingOandOramdomIfqualityisimproved.DoloopUntilnochange01234567891001234567891012January2023DataMining:ConceptsandTechniques36PAM(PartitioningAroundMedoids)(1987)PAM(KaufmanandRousseeuw,1987),builtinSplusUserealobjecttorepresenttheclusterSelectkrepresentativeobjectsarbitrarilyForeachpairofnon-selectedobjecthandselectedobjecti,calculatethetotalswappingcostTCihForeachpairofiandh,IfTCih<0,iisreplacedbyhThenassigneachnon-selectedobjecttothemostsimilarrepresentativeobjectrepeatsteps2-3untilthereisnochange12January2023DataMining:ConceptsandTechniques37PAMClustering:TotalswappingcostTCih=jCjihjihttihjhitjtihj12January2023DataMining:ConceptsandTechniques38WhatIstheProblemwithPAM?Pamismorerobustthank-meansinthepresenceofnoiseandoutliersbecauseamedoidislessinfluencedbyoutliersorotherextremevaluesthanameanPamworksefficientlyforsmalldatasetsbutdoesnotscalewellforlargedatasets.O(k(n-k)2)foreachiteration wherenis#ofdata,kis#ofclustersSamplingbasedmethod, CLARA(ClusteringLARgeApplications)12January2023DataMining:ConceptsandTechniques39CLARA(ClusteringLargeApplications)(1990)CLARA(KaufmannandRousseeuwin1990)Builtinstatisticalanalysispackages,suchasS+Itdrawsmultiplesamplesofthedataset,appliesPAMoneachsample,andgivesthebestclusteringastheoutputStrength:dealswithlargerdatasetsthanPAMWeakness:EfficiencydependsonthesamplesizeAgoodclusteringbasedonsampleswillnotnecessarilyrepresentagoodclusteringofthewholedatasetifthesampleisbiased12January2023DataMining:ConceptsandTechniques40CLARANS(“Randomized〞CLARA)(1994)CLARANS(AClusteringAlgorithmbasedonRandomizedSearch)(NgandHan’94)CLARANSdrawssampleofneighborsdynamicallyTheclusteringprocesscanbepresentedassearchingagraphwhereeverynodeisapotentialsolution,thatis,asetofkmedoidsIfthelocaloptimumisfound,CLARANSstartswithnewrandomlyselectednodeinsearchforanewlocaloptimumItismoreefficientandscalablethanbothPAMandCLARAFocusingtechniquesandspatialaccessstructuresmayfurtherimproveitsperformance(Esteretal.’95)12January2023DataMining:ConceptsandTechniques41Chapter7.ClusterAnalysisWhatisClusterAnalysis?TypesofDatainClusterAnalysisACategorizationofMajorClusteringMethodsPartitioningMethodsHierarchicalMethodsDensity-BasedMethodsGrid-BasedMethodsModel-BasedMethodsClusteringHigh-DimensionalDataConstraint-BasedClusteringOutlierAnalysisSummary12January2023DataMining:ConceptsandTechniques42HierarchicalClusteringUsedistancematrixasclusteringcriteria.Thismethoddoesnotrequirethenumberofclusterskasaninput,butneedsaterminationconditionStep0Step1Step2Step3Step4bdceaabdecdeabcdeStep4Step3Step2Step1Step0agglomerative(AGNES)divisive(DIANA)12January2023DataMining:ConceptsandTechniques43AGNES(AgglomerativeNesting)IntroducedinKaufmannandRousseeuw(1990)Implementedinstatisticalanalysispackages,e.g.,SplusUsetheSingle-Linkmethodandthedissimilaritymatrix.MergenodesthathavetheleastdissimilarityGooninanon-descendingfashionEventuallyallnodesbelongtothesamecluster12January2023DataMining:ConceptsandTechniques44Dendrogram:ShowsHowtheClustersareMergedDecomposedataobjectsintoaseverallevelsofnestedpartitioning(treeofclusters),calledadendrogram.Aclusteringofthedataobjectsisobtainedbycuttingthedendrogramatthedesiredlevel,theneachconnectedcomponentformsacluster.12January2023DataMining:ConceptsandTechniques45DIANA(DivisiveAnalysis)IntroducedinKaufmannandRousseeuw(1990)Implementedinstatisticalanalysispackages,e.g.,SplusInverseorderofAGNESEventuallyeachnodeformsaclusteronitsown12January2023DataMining:ConceptsandTechniques46RecentHierarchicalClusteringMethodsMajorweaknessofagglomerativeclusteringmethodsdonotscalewell:timecomplexityofatleastO(n2),wherenisthenumberoftotalobjectscanneverundowhatwasdonepreviouslyIntegrationofhierarchicalwithdistance-basedclusteringBIRCH(1996):usesCF-treeandincrementallyadjuststhequalityofsub-clustersROCK(1999):clusteringcategoricaldatabyneighborandlinkanalysisCHAMELEON(1999):hierarchicalclusteringusingdynamicmodeling12January2023DataMining:ConceptsandTechniques47BIRCH(1996)Birch:BalancedIterativeReducingandClusteringusingHierarchies(Zhang,Ramakrishnan&Livny,SIGMOD’96)IncrementallyconstructaCF(ClusteringFeature)tree,ahierarchicaldatastructureformultiphaseclusteringPhase1:scanDBtobuildaninitialin-memoryCFtree(amulti-levelcompressionofthedatathattriestopreservetheinherentclusteringstructureofthedata)Phase2:useanarbitraryclusteringalgorithmtoclustertheleafnodesoftheCF-treeScaleslinearly:findsagoodclusteringwithasinglescanandimprovesthequalitywithafewadditionalscansWeakness:handlesonlynumericdata,andsensitivetotheorderofthedatarecord.12January2023DataMining:ConceptsandTechniques48ClusteringFeatureVectorinBIRCHClusteringFeature:
CF=(N,LS,SS)N:NumberofdatapointsLS:Ni=1=XiSS:Ni=1=Xi2CF=(5,(16,30),(54,190))(3,4)(2,6)(4,5)(4,7)(3,8)12January2023DataMining:ConceptsandTechniques49CF-TreeinBIRCHClusteringfeature:summaryofthestatisticsforagivensubcluster:the0-th,1stand2ndmomentsofthesubclusterfromthestatisticalpointofview.registerscrucialmeasurementsforcomputingclusterandutilizesstorageefficientlyACFtreeisaheight-balancedtreethatstorestheclusteringfeaturesforahierarchicalclusteringAnonleafnodeinatreehasdescendantsor“children〞ThenonleafnodesstoresumsoftheCFsoftheirchildrenACFtreehastwoparametersBranchingfactor:specifythemaximumnumberofchildren.threshold:maxdiameterofsub-clustersstoredattheleafnodes12January2023DataMining:ConceptsandTechniques50TheCFTreeStructureCF1child1CF3child3CF2child2CF6child6CF1child1CF3child3CF2child2CF5child5CF1CF2CF6prevnextCF1CF2CF4prevnextB=7L=6RootNon-leafnodeLeafnodeLeafnode12January2023DataMining:ConceptsandTechniques51ClusteringCategoricalData:TheROCKAlgorithmROCK:RObustClusteringusinglinKsS.Guha,R.Rastogi&K.Shim,ICDE’99MajorideasUselinkstomeasuresimilarity/proximityNotdistance-basedComputationalcomplexity:Algorithm:sampling-basedclusteringDrawrandomsampleClusterwithlinksLabeldataindisk12January2023DataMining:ConceptsandTechniques52SimilarityMeasureinROCKTraditionalmeasuresforcategoricaldatamaynotworkwell,e.g.,JaccardcoefficientExample:Twogroups(clusters)oftransactionsC1.<a,b,c,d,e>:{a,b,c},{a,b,d},{a,b,e},{a,c,d},{a,c,e},{a,d,e},{b,c,d},{b,c,e},{b,d,e},{c,d,e}C2.<a,b,f,g>:{a,b,f},{a,b,g},{a,f,g},{b,f,g}Jaccardco-efficientmayleadtowrongclusteringresultC1:0.2({a,b,c},{b,d,e}}to0.5({a,b,c},{a,b,d})C1&C2:couldbeashighas0.5({a,b,c},{a,b,f})Jaccardco-efficient-basedsimilarityfunction: Ex.LetT1={a,b,c},T2={c,d,e}12January2023DataMining:ConceptsandTechniques53LinkMeasureinROCKLinks:#ofcommonneighborsC1<a,b,c,d,e>:{a,b,c},
{a,b,d},{a,b,e},
{a,c,d},{a,c,e},{a,d,e},{b,c,d},
{b,c,e},{b,d,e},{c,d,e}C2<a,b,f,g>:{a,b,f},
{a,b,g},{a,f,g},{b,f,g}LetT1={a,b,c},T2={c,d,e},T3={a,b,f},θ=0.5link(T1,T2)=4,sincetheyhave4commonneighbors{a,c,d},{a,c,e},{b,c,d},{b,c,e}link(T1,T3)=3,sincetheyhave3commonneighbors{a,b,d},{a,b,e},{a,b,g}ThuslinkisabettermeasurethanJaccardcoefficient12January2023DataMining:ConceptsandTechniques54CHAMELEON:HierarchicalClusteringUsingDynamicModeling(1999)CHAMELEON:byG.Karypis,E.H.Han,andV.Kumar’99MeasuresthesimilaritybasedonadynamicmodelTwoclustersaremergedonlyiftheinterconnectivity
andcloseness(proximity)betweentwoclustersarehighrelativetotheinternalinterconnectivityoftheclustersandclosenessofitemswithintheclustersCureignoresinformationaboutinterconnectivityoftheobjects,RockignoresinformationabouttheclosenessoftwoclustersAtwo-phasealgorithmUseagraphpartitioningalgorithm:clusterobjectsintoalargenumberofrelativelysmallsub-clustersUseanagglomerativehierarchicalclusteringalgorithm:findtheactualclustersbyrepeatedlycombiningthesesub-clusters12January2023DataMining:ConceptsandTechniques55OverallFrameworkofCHAMELEONConstructSparseGraphPartitiontheGraphMergePartitionFinalClustersDataSet12January2023DataMining:ConceptsandTechniques56CHAMELEON(ClusteringComplexObjects)12January2023DataMining:ConceptsandTechniques57Chapter7.ClusterAnalysisWhatisClusterAnalysis?TypesofDatainClusterAnalysisACategorizationofMajorClusteringMethodsPartitioningMethodsHierarchicalMethodsDensity-BasedMethodsGrid-BasedMethodsModel-BasedMethodsClusteringHigh-DimensionalDataConstraint-BasedClusteringOutlierAnalysisSummary12January2023DataMining:ConceptsandTechniques58Density-BasedClusteringMethodsClusteringbasedondensity(localclustercriterion),suchasdensity-connectedpointsMajorfeatures:DiscoverclustersofarbitraryshapeHandlenoiseOnescanNeeddensityparametersasterminationconditionSeveralinterestingstudies:DBSCAN:Ester,etal.(KDD’96)OPTICS:Ankerst,etal(SIGMOD’99).DENCLUE:Hinneburg&D.Keim(KDD’98)CLIQUE:Agrawal,etal.(SIGMOD’98)(moregrid-based)12January2023DataMining:ConceptsandTechniques59Density-BasedClustering:BasicConceptsTwoparameters:Eps:MaximumradiusoftheneighbourhoodMinPts:MinimumnumberofpointsinanEps-neighbourhoodofthatpointNEps(p): {qbelongstoD|dist(p,q)<=Eps}Directlydensity-reachable:Apointpisdirectlydensity-reachablefromapointqw.r.t.Eps,MinPtsif pbelongstoNEps(q)corepointcondition:|NEps(q)|>=MinPts
pqMinPts=5Eps=1cm12January2023DataMining:ConceptsandTechniques60Density-ReachableandDensity-ConnectedDensity-reachable:Apointpisdensity-reachablefromapointqw.r.t.Eps,MinPtsifthereisachainofpointsp1,…,pn,p1=q,pn=psuchthatpi+1isdirectlydensity-reachablefrompi
Density-connectedApointpisdensity-connectedtoapointqw.r.t.Eps,MinPtsifthereisapointosuchthatboth,pandqaredensity-reachablefromow.r.t.EpsandMinPtspqp1pqo12January2023DataMining:ConceptsandTechniques61DBSCAN:DensityBasedSpatialClusteringofApplicationswithNoiseReliesonadensity-basednotionofcluster:Aclusterisdefinedasamaximalsetofdensity-connectedpointsDiscoversclustersofarbitraryshapeinspatialdatabaseswithnoiseCoreBorderOutlierEps=1cmMinPts=512January2023DataMining:ConceptsandTechniques62DBSCAN:TheAlgorithmArbitraryselectapointpRetrieveallpointsdensity-reachablefrompw.r.t.EpsandMinPts.Ifpisacorep
溫馨提示
- 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
- 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
- 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
- 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
- 5. 人人文庫網(wǎng)僅提供信息存儲空間,僅對用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對任何下載內(nèi)容負(fù)責(zé)。
- 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請與我們聯(lián)系,我們立即糾正。
- 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。
最新文檔
- 2025年中國混凝土預(yù)制樁行業(yè)發(fā)展前景及投資戰(zhàn)略咨詢報(bào)告
- 排水防澇設(shè)施功能提升項(xiàng)目實(shí)施范圍與內(nèi)容
- 供熱管網(wǎng)及供熱設(shè)施提升改造技術(shù)路線與工藝選擇
- 中國月見草油市場發(fā)展前景預(yù)測及投資戰(zhàn)略研究報(bào)告
- 零碳數(shù)據(jù)算力中心項(xiàng)目可行性研究報(bào)告
- 抗裂砂漿合同范本
- 一次函數(shù)與一元一次不等式(基礎(chǔ))知識講解
- 牛肝菌買賣合同范本
- 2025年床式振動干燥機(jī)項(xiàng)目投資可行性研究分析報(bào)告
- 河南醫(yī)藥制造業(yè)市場前景及投資研究報(bào)告
- 主題班會:預(yù)防流行性感冒課件
- 英文報(bào)價(jià)單模板
- 無線電技術(shù)的起源與發(fā)展
- 管道吹掃、試壓檢驗(yàn)批質(zhì)量驗(yàn)收記錄表
- 數(shù)控銑床(加工中心)編程與操作完整版課件
- 感動中國人物-于敏
- 《中國特色社會主義法治理論》復(fù)習(xí)題集及解析共20篇
- 融資租賃租金計(jì)算表
- 數(shù)據(jù)結(jié)構(gòu)英文教學(xué)課件:Chapter 5 Recursion
- 《中國—東盟自由貿(mào)易區(qū)概論》新版
- 降低鉆孔灌注樁混凝土充盈系數(shù)QC
評論
0/150
提交評論