




版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請進行舉報或認領(lǐng)
文檔簡介
DataMining:
ConceptsandTechniques
—SlidesforTextbook—
—Chapter3—?JiaweiHanandMichelineKamberDepartmentofComputerScienceUniversityofIllinoisatUrbana-C/~hanj12/23/20221DataMining:ConceptsandTechniquesDataMining:
ConceptsandTecChapter3:DataPreprocessingWhypreprocessthedata?DatacleaningDataintegrationandtransformationDatareductionDiscretizationandconcepthierarchygenerationSummary12/23/20222DataMining:ConceptsandTechniquesChapter3:DataPreprocessingWWhyDataPreprocessing?Dataintherealworldisdirtyincomplete:lackingattributevalues,lackingcertainattributesofinterest,orcontainingonlyaggregatedatae.g.,occupation=“”noisy:containingerrorsoroutlierse.g.,Salary=“-10”inconsistent:containingdiscrepanciesincodesornamese.g.,Age=“42”Birthday=“03/07/2019”e.g.,Wasrating“1,2,3”,nowrating“A,B,C”e.g.,discrepancybetweenduplicaterecords12/23/20223DataMining:ConceptsandTechniquesWhyDataPreprocessing?DatainWhyIsDataDirty?Incompletedatacomesfromn/adatavaluewhencollecteddifferentconsiderationbetweenthetimewhenthedatawascollectedandwhenitisanalyzed.human/hardware/softwareproblemsNoisydatacomesfromtheprocessofdatacollectionentrytransmissionInconsistentdatacomesfromDifferentdatasourcesFunctionaldependencyviolation12/23/20224DataMining:ConceptsandTechniquesWhyIsDataDirty?IncompletedWhyIsDataPreprocessingImportant?Noqualitydata,noqualityminingresults!Qualitydecisionsmustbebasedonqualitydatae.g.,duplicateormissingdatamaycauseincorrectorevenmisleadingstatistics.DatawarehouseneedsconsistentintegrationofqualitydataDataextraction,cleaning,andtransformationcomprisesthemajorityoftheworkofbuildingadatawarehouse.—BillInmon12/23/20225DataMining:ConceptsandTechniquesWhyIsDataPreprocessingImpoMulti-DimensionalMeasureofDataQualityAwell-acceptedmultidimensionalview:AccuracyCompletenessConsistencyTimelinessBelievabilityValueaddedInterpretabilityAccessibilityBroadcategories:intrinsic,contextual,representational,andaccessibility.12/23/20226DataMining:ConceptsandTechniquesMulti-DimensionalMeasureofDMajorTasksinDataPreprocessingDatacleaningFillinmissingvalues,smoothnoisydata,identifyorremoveoutliers,andresolveinconsistenciesDataintegrationIntegrationofmultipledatabases,datacubes,orfilesDatatransformationNormalizationandaggregationDatareductionObtainsreducedrepresentationinvolumebutproducesthesameorsimilaranalyticalresultsDatadiscretizationPartofdatareductionbutwithparticularimportance,especiallyfornumericaldata12/23/20227DataMining:ConceptsandTechniquesMajorTasksinDataPreprocessFormsofdatapreprocessing
12/23/20228DataMining:ConceptsandTechniquesFormsofdatapreprocessing12Chapter3:DataPreprocessingWhypreprocessthedata?Datacleaning
DataintegrationandtransformationDatareductionDiscretizationandconcepthierarchygenerationSummary12/23/20229DataMining:ConceptsandTechniquesChapter3:DataPreprocessingWDataCleaningImportance“Datacleaningisoneofthethreebiggestproblemsindatawarehousing”—RalphKimball“Datacleaningisthenumberoneproblemindatawarehousing”—DCIsurveyDatacleaningtasksFillinmissingvaluesIdentifyoutliersandsmoothoutnoisydataCorrectinconsistentdataResolveredundancycausedbydataintegration12/23/202210DataMining:ConceptsandTechniquesDataCleaningImportance12/17/2MissingDataDataisnotalwaysavailableE.g.,manytupleshavenorecordedvalueforseveralattributes,suchascustomerincomeinsalesdataMissingdatamaybeduetoequipmentmalfunctioninconsistentwithotherrecordeddataandthusdeleteddatanotenteredduetomisunderstandingcertaindatamaynotbeconsideredimportantatthetimeofentrynotregisterhistoryorchangesofthedataMissingdatamayneedtobeinferred.12/23/202211DataMining:ConceptsandTechniquesMissingDataDataisnotalwaysHowtoHandleMissingData?Ignorethetuple:usuallydonewhenclasslabelismissing(assumingthetasksinclassification—noteffectivewhenthepercentageofmissingvaluesperattributevariesconsiderably.Fillinthemissingvaluemanually:tedious+infeasible?Fillinitautomaticallywithaglobalconstant:e.g.,“unknown”,anewclass?!theattributemeantheattributemeanforallsamplesbelongingtothesameclass:smarterthemostprobablevalue:inference-basedsuchasBayesianformulaordecisiontree12/23/202212DataMining:ConceptsandTechniquesHowtoHandleMissingData?IgnNoisyDataNoise:randomerrororvarianceinameasuredvariableIncorrectattributevaluesmayduetofaultydatacollectioninstrumentsdataentryproblemsdatatransmissionproblemstechnologylimitationinconsistencyinnamingconventionOtherdataproblemswhichrequiresdatacleaningduplicaterecordsincompletedatainconsistentdata12/23/202213DataMining:ConceptsandTechniquesNoisyDataNoise:randomerrorHowtoHandleNoisyData?Binningmethod:firstsortdataandpartitioninto(equi-depth)binsthenonecansmoothbybinmeans,smoothbybinmedian,smoothbybinboundaries,etc.ClusteringdetectandremoveoutliersCombinedcomputerandhumaninspectiondetectsuspiciousvaluesandcheckbyhuman(e.g.,dealwithpossibleoutliers)Regressionsmoothbyfittingthedataintoregressionfunctions12/23/202214DataMining:ConceptsandTechniquesHowtoHandleNoisyData?BinniSimpleDiscretizationMethods:BinningEqual-width(distance)partitioning:DividestherangeintoNintervalsofequalsize:uniformgridifAandBarethelowestandhighestvaluesoftheattribute,thewidthofintervalswillbe:W=(B–A)/N.Themoststraightforward,butoutliersmaydominatepresentationSkeweddataisnothandledwell.Equal-depth(frequency)partitioning:DividestherangeintoNintervals,eachcontainingapproximatelysamenumberofsamplesGooddatascalingManagingcategoricalattributescanbetricky.12/23/202215DataMining:ConceptsandTechniquesSimpleDiscretizationMethods:BinningMethodsforDataSmoothing*Sorteddataforprice(indollars):4,8,9,15,21,21,24,25,26,28,29,34*Partitioninto(equi-depth)bins:-Bin1:4,8,9,15-Bin2:21,21,24,25-Bin3:26,28,29,34*Smoothingbybinmeans:-Bin1:9,9,9,9-Bin2:23,23,23,23-Bin3:29,29,29,29*Smoothingbybinboundaries:-Bin1:4,4,4,15-Bin2:21,21,25,25-Bin3:26,26,26,3412/23/202216DataMining:ConceptsandTechniquesBinningMethodsforDataSmootClusterAnalysis12/23/202217DataMining:ConceptsandTechniquesClusterAnalysis12/17/202217DaRegressionxyy=x+1X1Y1Y1’12/23/202218DataMining:ConceptsandTechniquesRegressionxyy=x+1X1Y1Y1’12Chapter3:DataPreprocessingWhypreprocessthedata?DatacleaningDataintegrationandtransformationDatareductionDiscretizationandconcepthierarchygenerationSummary12/23/202219DataMining:ConceptsandTechniquesChapter3:DataPreprocessingWDataIntegrationDataintegration:combinesdatafrommultiplesourcesintoacoherentstoreSchemaintegrationintegratemetadatafromdifferentsourcesEntityidentificationproblem:identifyrealworldentitiesfrommultipledatasources,e.g.,A.cust-idB.cust-#Detectingandresolvingdatavalueconflictsforthesamerealworldentity,attributevaluesfromdifferentsourcesaredifferentpossiblereasons:differentrepresentations,differentscales,e.g.,metricvs.Britishunits12/23/202220DataMining:ConceptsandTechniquesDataIntegrationDataintegratiHandlingRedundancyinDataIntegrationRedundantdataoccuroftenwhenintegrationofmultipledatabasesThesameattributemayhavedifferentnamesindifferentdatabasesOneattributemaybea“derived”attributeinanothertable,e.g.,annualrevenueRedundantdatamaybeabletobedetectedbycorrelationalanalysisCarefulintegrationofthedatafrommultiplesourcesmayhelpreduce/avoidredundanciesandinconsistenciesandimproveminingspeedandquality12/23/202221DataMining:ConceptsandTechniquesHandlingRedundancyinDataInDataTransformationSmoothing:removenoisefromdataAggregation:summarization,datacubeconstructionGeneralization:concepthierarchyclimbingNormalization:scaledtofallwithinasmall,specifiedrangemin-maxnormalizationz-scorenormalizationnormalizationbydecimalscalingAttribute/featureconstructionNewattributesconstructedfromthegivenones12/23/202222DataMining:ConceptsandTechniquesDataTransformationSmoothing:DataTransformation:Normalizationmin-maxnormalizationz-scorenormalizationnormalizationbydecimalscalingWherejisthesmallestintegersuchthatMax(||)<112/23/202223DataMining:ConceptsandTechniquesDataTransformation:NormalizaChapter3:DataPreprocessingWhypreprocessthedata?DatacleaningDataintegrationandtransformationDatareductionDiscretizationandconcepthierarchygenerationSummary12/23/202224DataMining:ConceptsandTechniquesChapter3:DataPreprocessingWDataReductionStrategiesAdatawarehousemaystoreterabytesofdataComplexdataanalysis/miningmaytakeaverylongtimetorunonthecompletedatasetDatareductionObtainareducedrepresentationofthedatasetthatismuchsmallerinvolumebutyetproducethesame(oralmostthesame)analyticalresultsDatareductionstrategiesDatacubeaggregationDimensionalityreduction—removeunimportantattributesDataCompressionNumerosityreduction—fitdataintomodelsDiscretizationandconcepthierarchygeneration12/23/202225DataMining:ConceptsandTechniquesDataReductionStrategiesAdatDataCubeAggregationThelowestlevelofadatacubetheaggregateddataforanindividualentityofintereste.g.,acustomerinaphonecallingdatawarehouse.MultiplelevelsofaggregationindatacubesFurtherreducethesizeofdatatodealwithReferenceappropriatelevelsUsethesmallestrepresentationwhichisenoughtosolvethetaskQueriesregardingaggregatedinformationshouldbeansweredusingdatacube,whenpossible12/23/202226DataMining:ConceptsandTechniquesDataCubeAggregationThelowesDimensionalityReductionFeatureselection(i.e.,attributesubsetselection):Selectaminimumsetoffeaturessuchthattheprobabilitydistributionofdifferentclassesgiventhevaluesforthosefeaturesisascloseaspossibletotheoriginaldistributiongiventhevaluesofallfeaturesreduce#ofpatternsinthepatterns,easiertounderstandHeuristicmethods(duetoexponential#ofchoices):step-wiseforwardselectionstep-wisebackwardeliminationcombiningforwardselectionandbackwardeliminationdecision-treeinduction12/23/202227DataMining:ConceptsandTechniquesDimensionalityReductionFeaturExampleofDecisionTreeInductionInitialattributeset:{A1,A2,A3,A4,A5,A6}A4?A1?A6?Class1Class2Class1Class2>Reducedattributeset:{A1,A4,A6}12/23/202228DataMining:ConceptsandTechniquesExampleofDecisionTreeInducHeuristicFeatureSelectionMethodsThereare2d
possiblesub-featuresofdfeaturesSeveralheuristicfeatureselectionmethods:Bestsinglefeaturesunderthefeatureindependenceassumption:choosebysignificancetests.Beststep-wisefeatureselection:Thebestsingle-featureispickedfirstThennextbestfeatureconditiontothefirst,...Step-wisefeatureelimination:RepeatedlyeliminatetheworstfeatureBestcombinedfeatureselectionandelimination:Optimalbranchandbound:Usefeatureeliminationandbacktracking12/23/202229DataMining:ConceptsandTechniquesHeuristicFeatureSelectionMeDataCompressionStringcompressionThereareextensivetheoriesandwell-tunedalgorithmsTypicallylosslessButonlylimitedmanipulationispossiblewithoutexpansionAudio/videocompressionTypicallylossycompression,withprogressiverefinementSometimessmallfragmentsofsignalcanbereconstructedwithoutreconstructingthewholeTimesequenceisnotaudioTypicallyshortandvaryslowlywithtime12/23/202230DataMining:ConceptsandTechniquesDataCompressionStringcompresDataCompressionOriginalDataCompressedDatalosslessOriginalDataApproximatedlossy12/23/202231DataMining:ConceptsandTechniquesDataCompressionOriginalDataCWaveletTransformationDiscretewavelettransform(DWT):linearsignalprocessing,multiresolutionalanalysisCompressedapproximation:storeonlyasmallfractionofthestrongestofthewaveletcoefficientsSimilartodiscreteFouriertransform(DFT),butbetterlossycompression,localizedinspaceMethod:Length,L,mustbeanintegerpowerof2(paddingwith0s,whennecessary)Eachtransformhas2functions:smoothing,differenceAppliestopairsofdata,resultingintwosetofdataoflengthL/2Appliestwofunctionsrecursively,untilreachesthedesiredlength
Haar2Daubechie412/23/202232DataMining:ConceptsandTechniquesWaveletTransformationDiscretDWTforImageCompressionImage
LowPassHighPassLowPassHighPassLowPassHighPass12/23/202233DataMining:ConceptsandTechniquesDWTforImageCompressionImageGivenNdatavectorsfromk-dimensions,findc<=korthogonalvectorsthatcanbebestusedtorepresentdataTheoriginaldatasetisreducedtooneconsistingofNdatavectorsoncprincipalcomponents(reduceddimensions)EachdatavectorisalinearcombinationofthecprincipalcomponentvectorsWorksfornumericdataonlyUsedwhenthenumberofdimensionsislargePrincipalComponentAnalysis12/23/202234DataMining:ConceptsandTechniquesGivenNdatavectorsfromk-diX1X2Y1Y2PrincipalComponentAnalysis12/23/202235DataMining:ConceptsandTechniquesX1X2Y1Y2PrincipalComponentAnNumerosityReductionParametricmethodsAssumethedatafitssomemodel,estimatemodelparameters,storeonlytheparameters,anddiscardthedata(exceptpossibleoutliers)Log-linearmodels:obtainvalueatapointinm-DspaceastheproductonappropriatemarginalsubspacesNon-parametricmethods
DonotassumemodelsMajorfamilies:histograms,clustering,sampling12/23/202236DataMining:ConceptsandTechniquesNumerosityReductionParametricRegressionandLog-LinearModelsLinearregression:DataaremodeledtofitastraightlineOftenusestheleast-squaremethodtofitthelineMultipleregression:allowsaresponsevariableYtobemodeledasalinearfunctionofmultidimensionalfeaturevectorLog-linearmodel:approximatesdiscretemultidimensionalprobabilitydistributions12/23/202237DataMining:ConceptsandTechniquesRegressionandLog-LinearModeLinearregression:Y=+XTwoparameters,andspecifythelineandaretobeestimatedbyusingthedataathand.usingtheleastsquarescriteriontotheknownvaluesofY1,Y2,…,X1,X2,….Multipleregression:Y=b0+b1X1+b2X2.Manynonlinearfunctionscanbetransformedintotheabove.Log-linearmodels:Themulti-waytableofjointprobabilitiesisapproximatedbyaproductoflower-ordertables.Probability:p(a,b,c,d)=abacadbcdRegressAnalysisandLog-LinearModelsLinearregression:Y=+XHistogramsApopulardatareductiontechniqueDividedataintobucketsandstoreaverage(sum)foreachbucketCanbeconstructedoptimallyinonedimensionusingdynamicprogrammingRelatedtoquantizationproblems.12/23/202239DataMining:ConceptsandTechniquesHistogramsApopulardatareducClusteringPartitiondatasetintoclusters,andonecanstoreclusterrepresentationonlyCanbeveryeffectiveifdataisclusteredbutnotifdatais“smeared”Canhavehierarchicalclusteringandbestoredinmulti-dimensionalindextreestructuresTherearemanychoicesofclusteringdefinitionsandclusteringalgorithms,furtherdetailedinChapter812/23/202240DataMining:ConceptsandTechniquesClusteringPartitiondatasetiSamplingAllowaminingalgorithmtorunincomplexitythatispotentiallysub-lineartothesizeofthedataChoosearepresentativesubsetofthedataSimplerandomsamplingmayhaveverypoorperformanceinthepresenceofskewDevelopadaptivesamplingmethodsStratifiedsampling:Approximatethepercentageofeachclass(orsubpopulationofinterest)intheoveralldatabaseUsedinconjunctionwithskeweddataSamplingmaynotreducedatabaseI/Os(pageatatime).12/23/202241DataMining:ConceptsandTechniquesSamplingAllowaminingalgoritSamplingSRSWOR(simplerandomsamplewithoutreplacement)SRSWRRawData12/23/202242DataMining:ConceptsandTechniquesSamplingSRSWORSRSWRRawData12/SamplingRawDataCluster/StratifiedSample12/23/202243DataMining:ConceptsandTechniquesSamplingRawDataCluster/StratHierarchicalReductionUsemulti-resolutionstructurewithdifferentdegreesofreductionHierarchicalclusteringisoftenperformedbuttendstodefinepartitionsofdatasetsratherthan“clusters”ParametricmethodsareusuallynotamenabletohierarchicalrepresentationHierarchicalaggregationAnindextreehierarchicallydividesadatasetintopartitionsbyvaluerangeofsomeattributesEachpartitioncanbeconsideredasabucketThusanindextreewithaggregatesstoredateachnodeisahierarchicalhistogram12/23/202244DataMining:ConceptsandTechniquesHierarchicalReductionUsemultChapter3:DataPreprocessingWhypreprocessthedata?DatacleaningDataintegrationandtransformationDatareductionDiscretizationandconcepthierarchygenerationSummary12/23/202245DataMining:ConceptsandTechniquesChapter3:DataPreprocessingWDiscretizationThreetypesofattributes:Nominal—valuesfromanunorderedsetOrdinal—valuesfromanorderedsetContinuous—realnumbersDiscretization:dividetherangeofacontinuousattributeintointervalsSomeclassificationalgorithmsonlyacceptcategoricalattributes.ReducedatasizebydiscretizationPrepareforfurtheranalysis12/23/202246DataMining:ConceptsandTechniquesDiscretizationThreetypesofaDiscretizationandConcepthierachyDiscretization
reducethenumberofvaluesforagivencontinuousattributebydividingtherangeoftheattributeintointervals.IntervallabelscanthenbeusedtoreplaceactualdatavaluesConcepthierarchies
reducethedatabycollectingandreplacinglowlevelconcepts(suchasnumericvaluesfortheattributeage)byhigherlevelconcepts(suchasyoung,middle-aged,orsenior)12/23/202247DataMining:ConceptsandTechniquesDiscretizationandConcepthieDiscretizationandConceptHierarchyGenerationforNumericDataBinning(seesectionsbefore)Histogramanalysis(seesectionsbefore)Clusteringanalysis(seesectionsbefore)Entropy-baseddiscretizationSegmentationbynaturalpartitioning12/23/202248DataMining:ConceptsandTechniquesDiscretizationandConceptHieEntropy-BasedDiscretizationGivenasetofsamplesS,ifSispartitionedintotwointervalsS1andS2usingboundaryT,theentropyafterpartitioningisTheboundarythatminimizestheentropyfunctionoverallpossibleboundariesisselectedasabinarydiscretization.Theprocessisrecursivelyappliedtopartitionsobtaineduntilsomestoppingcriterionismet,e.g.,Experimentsshowthatitmayreducedatasizeandimproveclassificationaccuracy12/23/202249DataMining:ConceptsandTechniquesEntropy-BasedDiscretizationGiSegmentationbyNaturalPartitioningAsimply3-4-5rulecanbeusedtosegmentnumericdataintorelativelyuniform,“natural”intervals.Ifanintervalcovers3,6,7or9distinctvaluesatthemostsignificantdigit,partitiontherangeinto3equi-widthintervalsIfitcovers2,4,or8distinctvaluesatthemostsignificantdigit,partitiontherangeinto4intervalsIfitcovers1,5,or10distinctvaluesatthemostsignificantdigit,partitiontherangeinto5intervals12/23/202250DataMining:ConceptsandTechniquesSegmentationbyNaturalPartitExampleof3-4-5Rule(-$4000-$5,000)(-$400-0)(-$400--$300)(-$300--$200)(-$200--$100)(-$100-0)(0-$1,000)(0-$200)($200-$400)($400-$600)($600-$800)($800-$1,000)($2,000-$5,000)($2,000-$3,000)($3,000-$4,000)($4,000-$5,000)($1,000-$2,000)($1,000-$1,200)($1,200-$1,400)($1,400-$1,600)($1,600-$1,800)($1,800-$2,000)msd=1,000 Low=-$1,000 High=$2,000Step2:Step4:Step1:-$351 -$159 profit $1,838 $4,700 MinLow(i.e,5%-tile) High(i.e,95%-0tile)Maxcount(-$1,000-$2,000)(-$1,000-0)(0-$1,000)Step3:($1,000-$2,000)12/23/202251DataMining:ConceptsandTechniquesExampleof3-4-5Rule(-$4000-ConceptHierarchyGenerationforCategoricalDataSpecificationofapartialorderingofattributesexplicitlyattheschemalevelbyusersorexpertsstreet<city<state<countrySpecificationofaportionofahierarchybyexplicitdatagrouping{Urbana,Champaign,Chicago}<IllinoisSpecificationofasetofattributes.SystemautomaticallygeneratespartialorderingbyanalysisofthenumberofdistinctvaluesE.g.,
street<city<state<countrySpecificationofonlyapartialsetofattributesE.g.,onlystreet<city,notothers12/23/202252DataMining:ConceptsandTechniquesConceptHierarchyGenerationfAutomaticConceptHierarchyGenerationSomeconcepthierarchiescanbeautomaticallygeneratedbasedontheanalysisofthenumberofdistinctvaluesperattributeinthegivendatasetTheattributewiththemostdistinctvaluesisplacedatthelowestlevelofthehierarchyNote:Exception—weekday,month,quarter,yearcountryprovince_or_statecitystreet15distinctvalues65distinctvalues3567distinctvalues674,339distinctvalues12/23/202253DataMining:ConceptsandTechniquesAutomaticConceptHierarchyGeChapter3:DataPreprocessingWhypreprocessthedata?DatacleaningDataintegrationandtransformationDatareductionDiscretizationandconcepthierarchygenerationSummary12/23/202254DataMining:ConceptsandTechniquesChapter3:DataPreprocessingWSummaryDatapreparationisabigissueforbothwarehousingandminingDatapreparationincludesDatacleaninganddataintegrationDatareductionandfeatureselectionDiscretizationAlotamethodshavebeendevelopedbutstillanactiveareaofresearch12/23/202255DataMining:ConceptsandTechniquesSummaryDatapreparationisaReferencesE.RahmandH.H.Do.DataCleaning:ProblemsandCurrentApproaches.IEEEBulletinoftheTechnicalCommitteeonDataEngineering.Vol.23,No.4D.P.BallouandG.K.Tayi.Enhancingdataqualityindatawarehouseenvironments.CommunicationsofACM,42:73-78,2019.H.V.Jagadishetal.,SpecialIssueonDataReductionTechniques.BulletinoftheTechnicalCommitteeonDataEngineering,20(4),December2019.A.Maydanchik,ChallengesofEfficientDataCleansing(DMReview-DataQualityresourceportal)D.Pyle.DataPreparationforDataMining.MorganKaufmann,2019.D.Quass.AFrameworkforresearchinDataCleaning.(Draft2019)V.RamanandJ.Hellerstein.PottersWheel:AnInteractiveFrameworkforDataCleaningandTransformation,VLDB’2019.T.Redman.DataQuality:ManagementandTechnology.BantamBooks,NewYork,1992.Y.WandandR.Wang.Anchoringdataqualitydimensionsontologicalfoundations.CommunicationsofACM,39:86-95,2019.R.Wang,V.Storey,andC.Firth.Aframeworkforanalysisofdataqualityresearch.IEEETrans.KnowledgeandDataEngineering,7:623-640,2019./classes/spring01/cs240b/notes/data-integration1.pdf12/23/202256DataMining:ConceptsandTechniquesReferencesE.RahmandH.H.D/~hanjThankyou!!!12/23/202257DataMining:ConceptsandT/~hanjThankyou!!!DataMining:
ConceptsandTechniques
—SlidesforTextbook—
—Chapter3—?JiaweiHanandMichelineKamberDepartmentofComputerScienceUniversityofIllinoisatUrbana-C/~hanj12/23/202258DataMining:ConceptsandTechniquesDataMining:
ConceptsandTecChapter3:DataPreprocessingWhypreprocessthedata?DatacleaningDataintegrationandtransformationDatareductionDiscretizationandconcepthierarchygenerationSummary12/23/202259DataMining:ConceptsandTechniquesChapter3:DataPreprocessingWWhyDataPreprocessing?Dataintherealworldisdirtyincomplete:lackingattributevalues,lackingcertainattributesofinterest,orcontainingonlyaggregatedatae.g.,occupation=“”noisy:containingerrorsoroutlierse.g.,Salary=“-10”inconsistent:containingdiscrepanciesincodesornamese.g.,Age=“42”Birthday=“03/07/2019”e.g.,Wasrating“1,2,3”,nowrating“A,B,C”e.g.,discrepancybetweenduplicaterecords12/23/202260DataMining:ConceptsandTechniquesWhyDataPreprocessing?DatainWhyIsDataDirty?Incompletedatacomesfromn/adatavaluewhencollecteddifferentconsiderationbetweenthetimewhenthedatawascollectedandwhenitisanalyzed.human/hardware/softwareproblemsNoisydatacomesfromtheprocessofdatacollectionentrytransmissionInconsistentdatacomesfromDifferentdatasourcesFunctionaldependencyviolation12/23/202261DataMining:ConceptsandTechniquesWhyIsDataDirty?IncompletedWhyIsDataPreprocessingImportant?Noqualitydata,noqualityminingresults!Qualitydecisionsmustbebasedonqualitydatae.g.,duplicateormissingdatamaycauseincorrectorevenmisleadingstatistics.DatawarehouseneedsconsistentintegrationofqualitydataDataextraction,cleaning,andtransformationcomprisesthemajorityoftheworkofbuildingadatawarehouse.—BillInmon12/23/202262DataMining:ConceptsandTechniquesWhyIsDataPreprocessingImpoMulti-DimensionalMeasureofDataQualityAwell-acceptedmultidimensionalview:AccuracyCompletenessConsistencyTimelinessBelievabilityValueaddedInterpretabilityAccessibilityBroadcategories:intrinsic,contextual,representational,andaccessibility.12/23/202263DataMining:ConceptsandTechniquesMulti-DimensionalMeasureofDMajorTasksinDataPreprocessingDatacleaningFillinmissingvalues,smoothnoisydata,identifyorremoveoutliers,andresolveinconsistenciesDataintegrationIntegrationofmult
溫馨提示
- 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
- 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
- 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
- 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
- 5. 人人文庫網(wǎng)僅提供信息存儲空間,僅對用戶上傳內(nèi)容的表現(xiàn)方式做保護處理,對用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對任何下載內(nèi)容負責(zé)。
- 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請與我們聯(lián)系,我們立即糾正。
- 7. 本站不保證下載資源的準確性、安全性和完整性, 同時也不承擔(dān)用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。
最新文檔
- 二零二五年度醫(yī)療健康股權(quán)分配與產(chǎn)業(yè)布局協(xié)議
- 二零二五年度酒店員工勞動合同與員工培訓(xùn)及職業(yè)素養(yǎng)提升合同
- 2025年度金融控股集團間戰(zhàn)略資金借款合同
- 二零二五年度高空作業(yè)安全協(xié)議責(zé)任書(高空設(shè)備檢修安全協(xié)議)
- 二零二五年度鮮魚養(yǎng)殖與品牌營銷合作合同
- 二零二五年度電子商務(wù)平臺技術(shù)支持合同范本
- 二零二五年度汽車租賃代駕保險保障合同
- 2025年度餐廳門面租賃與地方旅游發(fā)展合作合同
- 安徽省2025年度非全日制用工勞動合同書解除與終止協(xié)議
- 數(shù)據(jù)安全保障與服務(wù)合作合同
- 拗九節(jié)班會方案
- 2022年八大員的勞務(wù)員考試題及答案
- DLT5210.4-2018熱工施工質(zhì)量驗收表格
- 醫(yī)院實習(xí)護士轉(zhuǎn)科表
- 2023年最新的郭氏宗祠的對聯(lián)大全
- 《中國古代文學(xué)史》宋代文學(xué)完整教學(xué)課件
- 新部編人教版四年級下冊道德與法治全冊教案(教學(xué)設(shè)計)
- 物業(yè)服務(wù)企業(yè)市場拓展戰(zhàn)略規(guī)劃課件
- 2018年青海大學(xué)碩士論文格式模板
- 四年級道德與法治從中國制造到中國創(chuàng)造
- 兒童跌倒評估量表(Humpty-Dumpty)
評論
0/150
提交評論