版權(quán)說(shuō)明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請(qǐng)進(jìn)行舉報(bào)或認(rèn)領(lǐng)
文檔簡(jiǎn)介
1、Chapter 20: Data Analysis Chapter 20:Data AnalysisDecisionSupport SystemsData WarehousingData MiningClassificationAssociationRulesClusteringDecisionSupport SystemsDecision-support systemsareusedtomake business decisions,oftenbasedondatacollectedbyon-linetransaction-processing systems.Examplesofbusin
2、essdecisions:What items to stock?What insurancepremium to change?Towhom to sendadvertisements?Examplesofdata usedfor makingdecisionsRetailsalestransactiondetailsCustomerprofiles(income,age, gender,etc.)Decision-Support Systems: OverviewData analysistasksaresimplifiedbyspecializedtoolsandSQL extensio
3、nsExample tasksForeachproduct category andeach region,whatwere thetotalsalesinthelastquarter andhowdotheycompare withthe samequarterlast yearAsabove,for eachproductcategoryandeachcustomercategoryStatisticalanalysispackages(e.g.,:S+) canbeinterfaced withdatabasesStatisticalanalysisisa large field,but
4、not coveredhereData miningseekstodiscoverknowledgeautomaticallyintheformofstatisticalrulesandpatternsfromlargedatabases.Adata warehousearchivesinformationgatheredfrom multiple sources, andstoresitunderaunified schema,atasinglesite.Importantfor large businessesthatgeneratedata frommultipledivisions,p
5、ossiblyatmultiplesitesData mayalso be purchasedexternallyData WarehousingData sourcesoftenstoreonlycurrent data, nothistorical dataCorporatedecisionmaking requires aunifiedview of allorganizationaldata,includinghistoricaldataAdata warehouseisa repository(archive) of information gathered frommultiple
6、sources,stored under aunifiedschema, at asingle siteGreatly simplifiesquerying, permitsstudyofhistoricaltrendsShiftsdecisionsupportqueryload awayfromtransactionprocessing systemsData WarehousingDesignIssuesWhen andhowtogather dataSourcedriven architecture: datasourcestransmitnewinformationtowarehous
7、e,eithercontinuously or periodically(e.g.,atnight)Destinationdrivenarchitecture: warehouseperiodicallyrequestsnew information fromdatasourcesKeeping warehouseexactly synchronizedwith datasources(e.g.,usingtwo-phasecommit)istooexpensiveUsually OK to haveslightlyout-of-datedataatwarehouseData/updatesa
8、re periodicallydownloaded formonline transaction processing(OLTP) systems.What schematouseSchemaintegrationMore WarehouseDesignIssuesData cleansingE.g.,correct mistakes in addresses(misspellings,zipcodeerrors)Mergeaddress lists fromdifferent sourcesandpurgeduplicatesHowtopropagate updatesWarehousesc
9、hema maybea (materialized) viewofschema fromdatasourcesWhat datatosummarizeRawdatamaybetoo large to store on-lineAggregatevalues (totals/subtotals)oftensufficeQueries on rawdata canoftenbetransformedbyqueryoptimizertouse aggregatevaluesWarehouseSchemasDimensionvalues areusually encodedusingsmallinte
10、gersand mappedtofull valuesviadimension tablesResultantschema is calledastar schemaMore complicated schemastructuresSnowflakeschema: multiple levelsofdimensiontablesConstellation: multiple facttablesData WarehouseSchemaData MiningData miningistheprocessofsemi-automaticallyanalyzing large databasesto
11、find usefulpatternsPredictionbasedonpast historyPredict if acredit cardapplicant poses agoodcreditrisk,basedonsomeattributes (income, jobtype,age, .)andpasthistoryPredict if apatternofphonecalling cardusageislikely to be fraudulentSome examples of predictionmechanisms:ClassificationGivena newitem wh
12、ose class is unknown, predicttowhichclassitbelongsRegressionformulaeGivena setofmappingsforanunknownfunction,predictthefunctionresult fora newparametervalueData Mining(Cont.)DescriptivePatternsAssociationsFind books thatare often boughtby“similar”customers.Ifanewsuchcustomerbuys onesuch book, sugges
13、tthe otherstoo.Associationsmay be usedasafirststep in detectingcausationE.g.,associationbetween exposure to chemical Xand cancer,ClustersE.g.,typhoid cases wereclustered in an areasurroundingacontaminatedwellDetectionofclustersremainsimportantindetecting epidemicsClassificationRulesClassificationrul
14、eshelp assignnewobjectstoclasses.E.g.,givena newautomobile insuranceapplicant, shouldheorshebeclassifiedaslowrisk,medium riskorhighrisk?Classificationrulesforaboveexamplecoulduseavariety of data, suchaseducationallevel, salary,age,etc.personP,P.degree =mastersandP.income 75,000P.credit= excellentper
15、sonP,P.degree =bachelorsand(P.income25,000and P.income75,000)P.credit= goodRulesarenot necessarily exact:theremaybesomemisclassificationsClassificationrulescanbeshowncompactly as adecisiontree.DecisionTreeConstructionofDecisionTreesTrainingset: adatasampleinwhichthe classification is alreadyknown.Gr
16、eedytopdowngeneration of decision trees.Each internal nodeofthe treepartitionsthedatainto groupsbasedonapartitioningattribute, andapartitioningconditionforthe nodeLeafnode:all(or most) of theitemsatthenodebelongtothe sameclass, orallattributeshave beenconsidered,and no furtherpartitioning is possibl
17、e.Best SplitsPick bestattributesandconditionsonwhichtopartitionThepurity of aset Softraininginstances canbemeasuredquantitativelyinseveral ways.Notation:number of classes=k,numberofinstances =|S|,fractionofinstances in classi=pi.TheGinimeasure of purityisdefined asGini (S)= 1-When allinstancesare in
18、 asingle class,theGinivalueis0Itreaches itsmaximum (of1 1 /k) if eachclassthe samenumber of instances. ki- 1p2iBest Splits(Cont.)Another measureofpurity is theentropymeasure,whichisdefined asentropy (S)= When aset Sissplitintomultiplesets Si,I=1, 2, , r, we canmeasure thepurityofthe resultantsetofse
19、tsas:purity(S1, S2, .,Sr) =TheinformationgainduetoparticularsplitofS intoSi, i=1,2,.,rInformation-gain(S, S1,S2, .,Sr) =purity(S) purity (S1,S2, Sr)ri= 1|Si|S|purity (Si)ki- 1pilog2 piBest Splits(Cont.)Measure of “cost”ofa split:Information-content(S, S1,S2, .,Sr)=Information-gain ratio=Information-
20、gain (S,S1,S2, ,Sr)Information-content (S, S1,S2, .,Sr)Thebestsplitistheone thatgivesthe maximuminformationgainratiolog2ri- 1|Si|S|Si|S| Finding BestSplitsCategoricalattributes (with no meaningfulorder):Multi-waysplit, onechildforeachvalueBinarysplit: tryallpossiblebreakupofvaluesintotwosets,and pic
21、kthe bestContinuous-valuedattributes (canbesorted in ameaningfulorder)Binarysplit:Sort values,try eachasasplitpointE.g.,ifvaluesare 1, 10,15,25, split at1, 10, 15Pick thevaluethat gives bestsplitMulti-waysplit:A seriesofbinarysplits on thesame attributehasroughlyequivalent effectDecision-Tree Constr
22、uctionAlgorithmProcedureGrowTree(S)Partition(S);ProcedurePartition(S)if(purity(S) por|S| s)thenreturn;foreachattributeAevaluatesplitsonattributeA;Usebestsplitfound(acrossallattributes)topartitionSintoS1, S2, .,Sr,fori= 1, 2, .,rPartition(Si);OtherTypesofClassifiersNeuralnet classifiers arestudied in
23、 artificialintelligence andarenot coveredhereBayesianclassifiersuseBayestheorem, which saysp(cj|d) =p(d| cj)p(cj)p(d)wherep(cj|d) =probabilityofinstancedbeinginclasscj,p(d| cj) =probabilityofgeneratinginstancedgivenclasscj,p(cj)= probability of occurrenceofclasscj, andp(d) =probabilityofinstancedocc
24、uringNaveBayesianClassifiersBayesianclassifiersrequirecomputationofp(d| cj)precomputationofp(cj)p(d) canbeignored since it is thesame forallclassesTosimplifythetask,naveBayesianclassifiersassumeattributeshave independent distributions, andthereby estimatep(d|cj) =p(d1|cj) *p(d2|cj) *.* (p(dn|cj)Each
25、 of thep(di|cj) canbeestimatedfroma histogramondivaluesfor eachclasscjthehistogram is computed fromthe training instancesHistograms on multiple attributesare moreexpensive to computeand storeRegressionRegression deals withthe predictionofavalue,rather thanaclass.Givenvaluesfor aset of variables,X1,
26、X2, , Xn, we wishtopredictthevalueofavariableY.Oneway is to infer coefficientsa0, a1, a1, , ansuch thatY=a0+a1*X1+a2*X2+ +an*XnFinding suchalinearpolynomialiscalledlinearregression.Ingeneral,theprocessoffinding acurvethatfits thedata is alsocalledcurvefitting.Thefit mayonly be approximatebecause of
27、noise in thedata,orbecause therelationshipisnot exactlyapolynomialRegression aimstofindcoefficientsthatgive thebest possible fit.AssociationRulesRetailshopsare often interestedinassociations betweendifferent items thatpeople buy.Someone whobuys bread is quite likelyalso to buymilkA personwhobought t
28、hebookDatabaseSystemConceptsisquitelikelyalsotobuythe bookOperatingSystem Concepts.Associationsinformationcan be usedinseveralways.E.g.,when acustomerbuysa particularbook,anonline shopmay suggestassociatedbooks.Associationrules:breadmilkDB-Concepts,OS-Concepts NetworksLeft handside:antecedent,righth
29、and side:consequentAnassociationrule musthaveanassociatedpopulation; thepopulation consists of aset ofinstancesE.g.,each transaction (sale)ata shopisaninstance, andtheset of alltransactionsisthe populationAssociationRules(Cont.)Ruleshave an associatedsupport,aswellasanassociated confidence.Supportis
30、a measureofwhatfractionofthepopulationsatisfiesboththeantecedentandthe consequentofthe rule.E.g.,suppose only0.001percentofallpurchases includemilkandscrewdrivers.Thesupportforthe ruleismilkscrewdriversislow.Confidenceisa measureofhow often theconsequent is truewhentheantecedentistrue.E.g.,therulebr
31、eadmilkhasaconfidence of 80 percentif80percentofthepurchases thatincludebreadalso includemilk.Finding Association RulesWearegenerally onlyinterestedinassociationruleswith reasonablyhighsupport (e.g.,support of 2% or greater)NavealgorithmConsiderall possible setsofrelevantitems.Foreachsetfinditssuppo
32、rt(i.e.,counthow manytransactionspurchaseallitemsinthe set).Largeitemsets: setswithsufficientlyhighsupportUselargeitemsetstogenerateassociationrules.From itemsetAgeneratetheruleA- b bforeachbA.Support of rule=support (A).Confidence of rule=support (A) /support(A- b)Finding SupportDeterminesupportofi
33、temsetsviaasinglepassonsetoftransactionsLargeitemsets:setswith ahighcountattheend of thepassIfmemorynot enoughtohold allcountsfor allitemsetsusemultiplepasses,consideringonly someitemsetsineachpass.Optimization: Onceanitemsetiseliminated becauseits count (support)istoosmallnoneofitssupersets needs t
34、o be considered.Thea prioritechniquetofindlargeitemsets:Pass 1: count supportofall setswithjust 1item.Eliminatethoseitemswith lowsupportPassi:candidates: every setofiitemssuch thatall itsi-1item subsetsare largeCountsupport of allcandidatesStop if there arenocandidatesOtherTypesofAssociationsBasicas
35、sociationruleshave severallimitationsDeviations fromthe expected probability aremore interestingE.g.,ifmany peoplepurchasebread,and manypeople purchase cereal,quiteafewwouldbeexpectedtopurchasebothWeareinterestedinpositiveaswell asnegativecorrelationsbetween setsofitemsPositivecorrelation:co-occurre
36、nceishigherthanpredictedNegativecorrelation:co-occurrenceislowerthan predictedSequenceassociations/correlationsE.g.,wheneverbondsgoup,stockprices go downin2daysDeviations fromtemporalpatternsE.g.,deviationfroma steadygrowthE.g.,salesofwinterweargodown in summerNotsurprising,partofa known pattern.Loo
37、k fordeviationfromvaluepredictedusingpastpatternsClusteringClustering:Intuitively,findingclustersofpointsinthe given datasuchthat similarpoints lieinthesameclusterCanbeformalizedusingdistancemetrics in severalwaysGrouppointsintoksets (foragivenk) suchthattheaveragedistanceofpointsfromthecentroidofth
38、eirassignedgroupisminimizedCentroid:pointdefinedbytakingaverageofcoordinatesineach dimension.Another metric:minimizeaveragedistancebetween every pairofpoints in aclusterHasbeenstudied extensively in statistics, butonsmalldata setsData miningsystems aimatclustering techniquesthatcanhandle verylargeda
39、tasetsE.g.,theBirchclusteringalgorithm(moreshortly)HierarchicalClusteringExample frombiologicalclassification(the wordclassificationheredoes notmean apredictionmechanism)chordatamammaliareptilialeopardshumanssnakescrocodilesOtherexamples:Internetdirectory systems(e.g., Yahoo,more on thislater)Agglom
40、erative clusteringalgorithmsBuildsmallclusters,thencluster small clusters intobigger clusters,andsoonDivisiveclustering algorithmsStartwith allitemsina singlecluster,repeatedly refine(break) clusters intosmalleronesClustering AlgorithmsClustering algorithmshavebeen designed to handlevery large datasetsE.g.,theBirchalgorithmMain idea: useanin-memoryR-tree to store pointsthat arebeingclusteredInsertpoints oneata timeintotheR
溫馨提示
- 1. 本站所有資源如無(wú)特殊說(shuō)明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請(qǐng)下載最新的WinRAR軟件解壓。
- 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請(qǐng)聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
- 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁(yè)內(nèi)容里面會(huì)有圖紙預(yù)覽,若沒(méi)有圖紙預(yù)覽就沒(méi)有圖紙。
- 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
- 5. 人人文庫(kù)網(wǎng)僅提供信息存儲(chǔ)空間,僅對(duì)用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對(duì)用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對(duì)任何下載內(nèi)容負(fù)責(zé)。
- 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請(qǐng)與我們聯(lián)系,我們立即糾正。
- 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶因使用這些下載資源對(duì)自己和他人造成任何形式的傷害或損失。
最新文檔
- 2025版?zhèn)€人短期小額借款合同示范文本
- 2025年度店鋪裝修施工與室內(nèi)綠化設(shè)計(jì)合同范本
- 教育科技融合小學(xué)數(shù)學(xué)游戲化學(xué)習(xí)的實(shí)施策略
- 科技助力下的兒童健康成長(zhǎng)路徑探索
- 二零二五年度車輛保險(xiǎn)理賠設(shè)備租賃協(xié)議3篇
- 2025年度個(gè)人帶車庫(kù)公寓買賣合同書
- 漯河2024年河南漯河市農(nóng)業(yè)農(nóng)村局招聘高層次人才6人筆試歷年參考題庫(kù)附帶答案詳解
- 二零二五年度文化產(chǎn)業(yè)園區(qū)運(yùn)營(yíng)承包合同書3篇
- 2025年度外墻保溫項(xiàng)目節(jié)能減排與施工總承包協(xié)議4篇
- 朝陽(yáng)2024年遼寧朝陽(yáng)師范學(xué)院招聘37人筆試歷年參考題庫(kù)附帶答案詳解
- 2024屆上海市浦東新區(qū)高三二模英語(yǔ)卷
- 大連高新區(qū)整體發(fā)展戰(zhàn)略規(guī)劃(產(chǎn)業(yè)及功能布局)
- 2024年智慧工地相關(guān)知識(shí)考試試題及答案
- 輸液室運(yùn)用PDCA降低靜脈輸液患者外滲的發(fā)生率品管圈(QCC)活動(dòng)成果
- YY/T 0681.2-2010無(wú)菌醫(yī)療器械包裝試驗(yàn)方法第2部分:軟性屏障材料的密封強(qiáng)度
- GB/T 8005.2-2011鋁及鋁合金術(shù)語(yǔ)第2部分:化學(xué)分析
- 不動(dòng)產(chǎn)登記實(shí)務(wù)培訓(xùn)教程課件
- 不銹鋼制作合同范本(3篇)
- 2023年系統(tǒng)性硬化病診斷及診療指南
- 煙氣管道阻力計(jì)算
- 《英語(yǔ)教師職業(yè)技能訓(xùn)練簡(jiǎn)明教程》全冊(cè)配套優(yōu)質(zhì)教學(xué)課件
評(píng)論
0/150
提交評(píng)論