關(guān)聯(lián)規(guī)則挖掘_第1頁
關(guān)聯(lián)規(guī)則挖掘_第2頁
關(guān)聯(lián)規(guī)則挖掘_第3頁
關(guān)聯(lián)規(guī)則挖掘_第4頁
關(guān)聯(lián)規(guī)則挖掘_第5頁
已閱讀5頁,還剩96頁未讀, 繼續(xù)免費(fèi)閱讀

下載本文檔

版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請(qǐng)進(jìn)行舉報(bào)或認(rèn)領(lǐng)

文檔簡介

1、數(shù)據(jù)挖掘與商務(wù)智能Data Mining & Business Intelligence第三章第三章 關(guān)聯(lián)規(guī)則挖掘關(guān)聯(lián)規(guī)則挖掘西安電子科技大學(xué)軟件學(xué)院主講人:黃健斌內(nèi)容提綱n關(guān)聯(lián)規(guī)則挖掘簡介n關(guān)聯(lián)規(guī)則基本模型n關(guān)聯(lián)規(guī)則挖掘的經(jīng)典算法Apriori FP-Growthn參考文獻(xiàn). 關(guān)聯(lián)規(guī)則簡介n關(guān)聯(lián)規(guī)則反映一個(gè)事物與其它事物之間的相互依存性和關(guān)聯(lián)性。如果兩個(gè)或者多個(gè)事物之間存在一定的關(guān)聯(lián)關(guān)系,那么,其中一個(gè)事物發(fā)生就能夠預(yù)測(cè)與它相關(guān)聯(lián)的其它事物的發(fā)生。n關(guān)聯(lián)規(guī)則挖掘的經(jīng)典范例:購物籃(Market Basket)分析。通過發(fā)現(xiàn)顧客放入購物籃中商品之間的同現(xiàn)關(guān)系來分析顧客的購買習(xí)慣,從而實(shí)

2、現(xiàn)商品的交叉銷售和推薦。 事務(wù)與規(guī)則nGiven a set of transactions, find rules that will predict the occurrence of an item based on the occurrences of other items in the transaction購物籃事物集合:購物籃事物集合:關(guān)聯(lián)規(guī)則示例:關(guān)聯(lián)規(guī)則示例:Diaper Beer,Milk, Bread Eggs,Coke,Beer, Bread Milk,Implication means co-occurrence, not causality!蘊(yùn)含意味著同現(xiàn)而非因果

3、關(guān)系!事務(wù)集合:所有事務(wù)集合事務(wù)集合:所有事務(wù)集合T=t1,t2, tN項(xiàng)集:所有項(xiàng)的集合項(xiàng)集:所有項(xiàng)的集合I=i1, i2, id事務(wù)事務(wù): 若干項(xiàng)組成的集合若干項(xiàng)組成的集合tj=ij1, ij2, , ijkFrequent Itemset 頻繁項(xiàng)集nItemset 項(xiàng)項(xiàng)A collection of one or more itemsnExample: Milk, Bread, Diaperk-itemset k-項(xiàng)nAn itemset that contains k itemsnSupport count ( ) 支持度計(jì)數(shù)支持度計(jì)數(shù)Frequency of occurrence o

4、f an itemsetE.g. (Milk, Bread,Diaper) = 2 nFrequent Itemset 頻繁項(xiàng)集頻繁項(xiàng)集An itemset whose support is greater than or equal to a minsup threshold 最小支持度計(jì)數(shù)(),iiiXt Xt tTAssociation Rule 關(guān)聯(lián)規(guī)則()()XYs XYNnAssociation RuleAn implication expression of the form X Y, where X and Y are itemsetsExample: Milk, Diaper

5、 Beer nRule Evaluation MetricsSupport (s) 支持度nFraction of transactions that contain both X and YConfidence (c) 置信度nMeasures how often items in Y appear in transactions thatcontain X()()()XYc XYXDefinition: Association RuleExample:BeerDiaper,Milk4 . 052|T|)BeerDiaper,Milk(s67. 032)Diaper,Milk()BeerDi

6、aper,Milk,(cnAssociation RuleAn implication expression of the form X Y, where X and Y are itemsetsExample: Milk, Diaper Beer nRule Evaluation MetricsSupport (s)nFraction of transactions that contain both X and YConfidence (c)nMeasures how often items in Y appear in transactions thatcontain X規(guī)則度量:支持度

7、與置信度信度n規(guī)則 X Y 的支持度和可信度支持度支持度 s:一次交易中同時(shí)包含X 、 Y 的可能性置信度置信度 c :包含項(xiàng)X 的交易中同時(shí)也包含Y的條件概率交易ID購買的商品2000A,B,C1000A,C4000A,D5000B,E,F設(shè)最小支持度為0.5, 最小可信度為 0.5, 則可得到關(guān)聯(lián)規(guī)則A C (0.5, 0.67)C A (0.5, 1)買尿布的客戶買尿布的客戶二者都買二者都買的客戶的客戶買啤酒的客戶買啤酒的客戶什么是關(guān)聯(lián)規(guī)則挖掘n關(guān)聯(lián)規(guī)則挖掘 首先被IBM公司Almaden研究中心的R. Agrawal, Imielinski and Swami在1993年的SIGMOD

8、會(huì)議上提出在事務(wù)、關(guān)系數(shù)據(jù)中發(fā)現(xiàn)頻繁項(xiàng)集和關(guān)聯(lián)規(guī)則頻繁項(xiàng)集頻繁項(xiàng)集: 事務(wù)數(shù)據(jù)中支持度大于最小支持度閾值minsup的所有項(xiàng)集關(guān)聯(lián)規(guī)則關(guān)聯(lián)規(guī)則:事務(wù)數(shù)據(jù)中支持度大于最小支持度閾值minsup且置信度大于最小置信度閾值minconf的所有規(guī)則什么是關(guān)聯(lián)規(guī)則挖掘n關(guān)聯(lián)規(guī)則挖掘的意義: 發(fā)現(xiàn)數(shù)據(jù)中的規(guī)律超市數(shù)據(jù)中的什么產(chǎn)品會(huì)一起購買? 啤酒和尿布在得知某用戶買了一臺(tái)PC之后,預(yù)測(cè)他同時(shí)還會(huì)購買什么商品?關(guān)聯(lián)規(guī)則挖掘nGiven a set of transactions T, the goal of association rule mining is to find all rules havin

9、g support minsup thresholdconfidence minconf thresholdnBrute-force approach 蠻力方法:List all possible association rulesCompute the support and confidence for each rulePrune rules that fail the minsup and minconf thresholds 計(jì)算開銷極大,無法應(yīng)用于大規(guī)模數(shù)據(jù)集!關(guān)聯(lián)規(guī)則挖掘Example of Rules:Milk,Diaper Beer (s=0.4, c=0.67)Milk,B

10、eer Diaper (s=0.4, c=1.0)Diaper,Beer Milk (s=0.4, c=0.67)Beer Milk,Diaper (s=0.4, c=0.67) Diaper Milk,Beer (s=0.4, c=0.5) Milk Diaper,Beer (s=0.4, c=0.5)Observations 觀察到的現(xiàn)象: All the above rules are binary partitions of the same itemset: Milk, Diaper, Beer Rules originating from the same itemset have

11、 identical support but can have different confidence Thus, we may decouple the support and confidence requirements關(guān)聯(lián)規(guī)則挖掘nTwo-step approach: Frequent Itemset Generation 產(chǎn)生頻繁項(xiàng)集Generate all itemsets whose support minsupRule Generation 產(chǎn)生規(guī)則Generate high confidence rules from each frequent itemset, where

12、 each rule is a binary partitioning of a frequent itemsetnFrequent itemset generation is still computationally expensive 產(chǎn)生頻繁項(xiàng)集的計(jì)算代價(jià)依然很高產(chǎn)生頻繁項(xiàng)集nullABACADAEBCBDBECDCEDEABCDEABCABDABEACDACEADEBCDBCEBDECDEABCDABCEABDEACDEBCDEABCDEGiven d items, there are 2d possible candidate itemsets產(chǎn)生頻繁項(xiàng)集nBrute-force

13、approach 蠻力法: Each itemset in the lattice is a candidate frequent itemsetCount the support of each candidate by scanning the databaseMatch each transaction against every candidateComplexity O(NMw) = Expensive since M = 2d !計(jì)算復(fù)雜度nGiven d unique items:Total number of itemsets = 2dTotal number of possi

14、ble association rules: 1231111dddkkdjjkdkdRIf d=6, R = 602 rules頻繁項(xiàng)集產(chǎn)生算法1:ApriorinApriori算法命名源于算法使用了頻繁項(xiàng)集性質(zhì)的先驗(yàn)(Prior)知識(shí)。 nApriori算法將發(fā)現(xiàn)關(guān)聯(lián)規(guī)則的過程分為兩個(gè)步驟:檢索出事務(wù)數(shù)據(jù)庫中的所有頻繁項(xiàng)集,即支持度不低于用戶設(shè)定最小支持度計(jì)數(shù)閾值的項(xiàng)集;利用頻繁項(xiàng)集構(gòu)造出滿足用戶最小信任度的規(guī)則。n挖掘并產(chǎn)生所有頻繁項(xiàng)集是該算法的核心,占整個(gè)計(jì)算量的大部分。 頻繁項(xiàng)集的性質(zhì): n性質(zhì)1:頻繁項(xiàng)集的子集必為頻繁項(xiàng)集。 n性質(zhì)2:非頻繁項(xiàng)集的超集一定是非頻繁的。 nAprior

15、i算法運(yùn)用性質(zhì)1,通過已知的頻繁項(xiàng)集構(gòu)成長度更大的項(xiàng)集,并將其稱為潛在頻繁項(xiàng)集。潛在頻繁k項(xiàng)集的集合Ck 是指由有可能成為頻繁k項(xiàng)集的項(xiàng)集組成的集合。以后只需計(jì)算潛在頻繁項(xiàng)集的支持度,而不必計(jì)算所有不同項(xiàng)集的支持度,因此在一定程度上減少了計(jì)算量。 Found to be Infrequent頻繁項(xiàng)集的性質(zhì):Pruned supersetsApriori算法n(1) L1=頻繁1項(xiàng)集; n(2) for(k=2;Lk-1;k+) do begin n(3) Ck=apriori_gen(Lk-1); /新的潛在頻繁項(xiàng)集 n(4) for all transactions tD do begin

16、n(5) Ct=subset(Ck,t); /t中包含的潛在頻繁項(xiàng)集 n(6) for all candidates cCt do n(7) c.count+; n(8) end; n(9) Lk=cCk|c.countminsup n(10) end; n(11) Answer= kkL實(shí)例Database TDB1st scanC1L1L2C2C22nd scanC3L33rd scanTidItems10A, C, D20B, C, E30A, B, C, E40B, EItemsetsupA2B3C3D1E3ItemsetsupA2B3C3E3ItemsetA, BA, CA, EB,

17、 CB, EC, EItemsetsupA, B1A, C2A, E1B, C2B, E3C, E2ItemsetsupA, C2B, C2B, E3C, E2ItemsetB, C, EItemsetsupB, C, E2Visualization of Association Rules: Pane GraphVisualization of Association Rules: Rule Graph提高Apriori算法性能的方法nHash-based itemset counting(散列項(xiàng)集計(jì)數(shù))nTransaction reduction(事務(wù)壓縮)nPartitioning(劃分

18、)nSampling(采樣)n用Frequent-Pattern tree (FP-tree) 結(jié)構(gòu)壓縮數(shù)據(jù)庫, 高度濃縮,同時(shí)對(duì)頻繁集的挖掘是完備的避免代價(jià)較高的數(shù)據(jù)庫掃描n開發(fā)一種高效的基于FP-tree的頻繁集挖掘算法采用分而治之的方法學(xué):分解數(shù)據(jù)挖掘任務(wù)為小任務(wù)避免生成關(guān)聯(lián)規(guī)則: 只使用部分?jǐn)?shù)據(jù)庫!頻繁項(xiàng)集產(chǎn)生算法2:FP-growth構(gòu)建FP-treeTIDItems1A,B2B,C,D3A,C,D,E4A,D,E5A,B,C6A,B,C,D7B,C8A,B,C9A,B,D10B,C,EnullA:1B:1nullA:1B:1B:1C:1D:1After reading TID=1:

19、After reading TID=2:構(gòu)建FP-TreenullA:7B:5B:3C:3D:1C:1D:1C:3D:1D:1E:1E:1TIDItems1A,B2B,C,D3A,C,D,E4A,D,E5A,B,C6A,B,C,D7B,C8A,B,C9A,B,D10B,C,EPointers are used to assist frequent itemset generationD:1E:1Transaction DatabaseItemPointerABCDEHeader tablef:4c:1b:1p:1b:1c:3a:3b:1m:2p:2m:1頭表頭表Item frequency h

20、ead f4c4a3b3m3p3最小支持度最小支持度 = 0.5TIDItems bought (ordered) frequent items100f, a, c, d, g, i, m, pf, c, a, m, p200a, b, c, f, l, m, of, c, a, b, m300 b, f, h, j, of, b400 b, c, k, s, pc, b, p500 a, f, c, e, l, p, m, nf, c, a, m, p步驟:掃描數(shù)據(jù)庫一次,得到頻繁1-項(xiàng)集把項(xiàng)按支持度遞減排序1.再一次掃描數(shù)據(jù)庫,建立FP-tree建立簡化的FP-tree樹:示例n基本思想

21、(分治)用FP-tree遞歸增長頻繁集n方法 對(duì)每個(gè)項(xiàng),生成它的 條件模式庫, 然后是它的 條件 FP-tree對(duì)每個(gè)新生成的條件FP-tree,重復(fù)這個(gè)步驟直到結(jié)果FP-tree為空, 或只含唯一的一個(gè)路徑 (此路徑的每個(gè)子路徑對(duì)應(yīng)的項(xiàng)集都是頻繁集)用FP-tree挖掘頻繁集n從FP-tree的頭表開始n按照每個(gè)頻繁項(xiàng)的連接遍歷 FP-treen列出能夠到達(dá)此項(xiàng)的所有前綴路徑,得到條件模式庫條件模式庫條件模式庫itemcond. pattern basecf:3afc:3bfca:1, f:1, c:1mfca:2, fcab:1pfcam:2, cb:1f:4c:1b:1p:1b:1c:3

22、a:3b:1m:2p:2m:1頭表頭表Item frequency head f4c4a3b3m3p3步驟1: 從 FP-tree 到條件模式庫n對(duì)每個(gè)模式庫計(jì)算庫中每個(gè)項(xiàng)的支持度用模式庫中的頻繁項(xiàng)建立FP-treem-條件模式庫條件模式庫:fca:2, fcab:1f:3c:3a:3m-conditional FP-treeAll frequent patterns concerning mm, fm, cm, am, fcm, fam, cam, fcamf:4c:1b:1p:1b:1c:3a:3b:1m:2p:2m:1頭表頭表Item frequency head f4c4a3b3m3p

23、3步驟2: 建立條件 FP-treen完備: 不會(huì)打破交易中的任何模式包含了頻繁模式挖掘所需的全部信息n緊密去除不相關(guān)信息不包含非頻繁項(xiàng)支持度降序排列: 支持度高的項(xiàng)在FP-tree中共享的機(jī)會(huì)也高決不會(huì)比原數(shù)據(jù)庫大(如果不計(jì)算樹節(jié)點(diǎn)的額外開銷)FP-tree 結(jié)構(gòu)的優(yōu)點(diǎn)33The Frequent Pattern Growth Mining MethodnIdea: Frequent pattern growth 頻繁模式增長Recursively grow frequent patterns by pattern and database partitionnMethod For each

24、 frequent item, construct its conditional pattern-base, and then its conditional FP-treeRepeat the process on each newly created conditional FP-tree Until the resulting FP-tree is empty, or it contains only one pathsingle path will generate all the combinations of its sub-paths, each of which is a f

25、requent pattern34FP-Growth vs. Apriori: Scalability With the Support Threshold010203040506070809010000.511.522.53Support threshold(%)Run time(sec.)D1 FP-grow th runtimeD1 Apriori runtimeData set T25I20D10K2022年6月9日星期四Data Mining: Concepts and Techniques35FP-Growth vs. Tree-Projection: Scalability wi

26、th the Support Threshold02040608010012014000.511.52Support threshold (%)Runtime (sec.)D2 FP-growthD2 TreeProjectionData set T25I20D100K產(chǎn)生關(guān)聯(lián)規(guī)則n任務(wù)描述:給定頻繁項(xiàng)集Y, 查找Y的所有非空真子集X Y,使得 X Y X 的置信度超過最小置信度閾值minconf例子:If A,B,C is a frequent itemset, 候選規(guī)則如下: AB C, AC B, BC AA BC,B AC,C ABn如果 |Y| = k, 那么會(huì)有 2k 2 個(gè)候選關(guān)

27、聯(lián)規(guī)則 (不包括 Y and Y)產(chǎn)生關(guān)聯(lián)規(guī)則nHow to efficiently generate rules from frequent itemsets?通常,置信度不滿足反單調(diào)性(anti-monotone property ),例如:c(ABC D) 可能大于也可能小于 c(AB D)但是,針對(duì)同一個(gè)頻繁項(xiàng)集的關(guān)聯(lián)規(guī)則,如果規(guī)則的后件滿足子集關(guān)系,那么這些規(guī)則的置信度間滿足反單調(diào)性e.g., Y= A,B,C,D: c(ABC D) c(AB CD) c(A BCD)Rule Generation for Apriori AlgorithmLattice of rulesPrune

28、d RulesLow Confidence Rule針對(duì)Apriori 算法的規(guī)則產(chǎn)生方法nCandidate rule is generated by merging two rules that share the same prefixin the rule consequentnjoin(CD=AB,BD=AC)would produce the candidaterule D = ABCnPrune rule D=ABC if itssubset AD=BC does not havehigh confidenceBD=ACCD=ABD=ABC參考文獻(xiàn)nAgrawal R, Imie

29、linski T, and Swami A. Mining association rules between sets of items in large databases. SIGMOD, 207-216, 1993.nAgrawal R, and Srikant R. Fast algorithms for mining association rules in large databases. VLDB, 478-499, 1994. nHan J W, Pei J, Yin Y W. Mining frequent patterns without candidate genera

30、tion. SIGMOD, 1-12, 2000.nHan J W, Pei J, Yin Y W, and Mao R Y. Mining frequent patterns without candidate generation: a frequent-pattern tree approach. Data Mining and Knowledge Discovery. 8, 53-87, 2004課程安排n通過垂直數(shù)據(jù)格式挖掘關(guān)聯(lián)規(guī)則nECLATn挖掘最大、閉合關(guān)聯(lián)規(guī)則模nMaxMinernCLOSETnCLOSETnCHARM第一部分通過垂直數(shù)據(jù)格式挖掘關(guān)聯(lián)規(guī)則ECLAT基本概念n水

31、平數(shù)據(jù)格式:TID:itemsetn垂直數(shù)據(jù)格式:item:TID_setTID商品IDT100I1 I2 T200I2 I4T300I2 I3T400I1 I2 I3項(xiàng)TID集I1T100,T400I2T100,T200,T300,T400I3T300,T400I4T200ECLAT:使用垂直數(shù)據(jù)格式進(jìn)行挖掘nECLAT:生成-檢測(cè)模型n剪枝策略:子集不頻繁剪枝n算法步驟:1、掃描事務(wù)數(shù)據(jù)庫TDB,將水平數(shù)據(jù)格式轉(zhuǎn)化成垂直數(shù)據(jù)格式垂直數(shù)據(jù)格式。2、通過對(duì)集合操作,得到滿足最小支持度域值min_sup的頻繁頻繁1項(xiàng)集項(xiàng)集。3、通過頻繁k項(xiàng)集產(chǎn)生候選k+1項(xiàng)集,通過集合求集合求交交來產(chǎn)生頻繁K+

32、1項(xiàng)集。4、直到不再生成候選項(xiàng)集或不再產(chǎn)生頻繁項(xiàng)集,程序結(jié)束。算法實(shí)例例:TIDList of item IDsT100I1,I2,I5T200I2,I4T300I2,I3T400I1,I2,I4T500I1,I3T600I2,I3T700I1,I3T800I1,I2,I3,I5T900I1,I2,I3構(gòu)造垂直數(shù)據(jù)格式:構(gòu)造垂直數(shù)據(jù)格式:itemsetTID_setI1T100,T400,T500,T700,T800,T900I2T100,T200,T300,T400,T600,T800,T900I3T300,T500,T600,T700,T800,T900I4T200,T400I5T100,

33、T800ECLAT構(gòu)造頻繁1項(xiàng)集(min_sup=2):itemsetTID_setI1T100,T400,T500,T700,T800,T900I2T100,T200,T300,T400,T600,T800,T900I3T300,T500,T600,T700,T800,T900I4T200,T400I5T100,T800ECLATitemsetTID_setI1,I2T100,T400,T800,T900I1,I3T500,T700,T800,T900I1,I4T400I1,I5T100,T800I2,I3T300,T600,T800,T900I2,I4T200,T400I2,I5T100,

34、T800I3,I5T800構(gòu)造頻繁2項(xiàng)集(min_sup=2):ECLATECLAT直到?jīng)]有頻繁項(xiàng)集或候選項(xiàng)集產(chǎn)生,算法結(jié)束課程安排第二部分挖掘最大、閉合關(guān)聯(lián)規(guī)則模式 MaxMiner、CLOSET、CLOSET+、CHARMnDB = , min_sup=2;問題:頻繁項(xiàng)集有哪些呢250-1 !?基本概念n最大頻繁項(xiàng)集最大頻繁項(xiàng)集是這樣的頻繁項(xiàng)集,它的直接超集都是不頻繁的。最大頻繁最大頻繁項(xiàng)集項(xiàng)集基本概念n閉項(xiàng)集項(xiàng)集X是閉的,如果它的直接超集都不具有和它相同的支持度計(jì)數(shù)。n頻繁閉項(xiàng)集如果項(xiàng)集X是閉的,并且它的支持度計(jì)數(shù)大于或等于最小支持度閾值min_sup,則項(xiàng)集X是頻繁閉項(xiàng)集。最大項(xiàng)集VS

35、閉頻繁項(xiàng)集nullABACADAEBCBDBECDCEDEABCDEABCABDABEACDACEADEBCDBCEBDECDEABCDABCEABDEACDEBCDEABCDE12412312342453451212424412323243445122244423424閉頻繁項(xiàng)集閉頻繁項(xiàng)集且最大頻繁項(xiàng)集TID項(xiàng)1abc2abcd3bce4acde5deminsup=2TDB頻繁項(xiàng)集、最大頻繁項(xiàng)集和頻繁閉項(xiàng)集之間的關(guān)系MaxMiner:挖掘最大模式n完全集合枚舉樹A (BCD)B (CD)C (D)D ()AB (CD)AC (D) AD ()BC (D) BD ()CD ()ABC (C)A

36、BCD ()ABD () ACD ()BCD () (ABCD)設(shè)TDB中包含有項(xiàng)ABCDMaxMinern剪枝原理子集不頻繁剪枝:任何非頻繁項(xiàng)集的超集都是非頻繁項(xiàng)集;超集頻繁剪枝:超集頻繁剪枝:任何頻繁項(xiàng)集的子集都是頻繁項(xiàng)集;n搜索策略廣度優(yōu)先搜索完全集合枚舉樹廣度優(yōu)先搜索完全集合枚舉樹MaxMiner算法實(shí)例(1/4)TidItems10A,B,C,D,E20B,C,D,E,30A,C,D,F (ABCDEF)ItemsFrequencyABCDEF0A2B2C3D3E2F1Min_sup=2Max patterns:A (BCDE)B (CDE) C (DE)E ()D (E)MaxMi

37、ner算法實(shí)例(2/4)TidItems10A,B,C,D,E20B,C,D,E,30A,C,D,F (ABCDEF)ItemsFrequencyABCDE1AB1AC2AD2AE1Min_sup=2A (BCDE)B (CDE) C (DE)E ()D (E)AC (D) AD ()Max patterns:Node AMaxMiner算法實(shí)例(3/4)TidItems10A,B,C,D,E20B,C,D,E,30A,C,D,F (ABCDEF)ItemsFrequencyBCDE2BCBDBEMin_sup=2A (BCDE)B (CDE) C (DE)E ()D (E)AC (D) AD

38、 ()Max patterns:BCDENode BMaxMiner算法實(shí)例(4/4)TidItems10A,B,C,D,E20B,C,D,E,30A,C,D,F (ABCDEF)ItemsFrequencyACD2Min_sup=2A (BCDE)B (CDE) C (DE)E ()D (E)AC (D) AD ()Max patterns:BCDEACDNode ACCLOSET :高效的挖掘頻繁閉項(xiàng)集算法n頻繁項(xiàng)集列表(f_list)給定事務(wù)數(shù)據(jù)庫TDB和最小支持度計(jì)數(shù)閾值min_sup,將所有的頻繁項(xiàng)按照支持度計(jì)數(shù)遞減順序構(gòu)成的表。n條件數(shù)據(jù)庫(conditional database

39、)給定一個(gè)事務(wù)數(shù)據(jù)庫TDB,若項(xiàng)i是一個(gè)頻繁項(xiàng)。 項(xiàng)i的條件數(shù)據(jù)庫(用TDB|i表示),是TDB中包含項(xiàng)i的交易構(gòu)成的子集,并且所有出現(xiàn)的非頻繁項(xiàng)、項(xiàng)i和f_list中項(xiàng)i后面的項(xiàng)都刪除。構(gòu)造f_listTIDItems10a,c,d,e,f20a,b,e30c,e,f40a,c,d,f50c,e,f則由f_list定義得到:f_list:設(shè)有如下事物數(shù)據(jù)庫TDB(min_sup=2):首先計(jì)算出每個(gè)項(xiàng)的支持度計(jì)數(shù):a:3,b:1,c:4,d:2,e:4,f:4, 條件數(shù)據(jù)庫的構(gòu)造TDBcefadeacefcfadceff_list:d-cond DB(d:2)cefacfaa-cond DB

40、(a:3)cefecff-cond DB(f:4)ce:3ce-cond DB(e:4)cCLOSET算法n引理1挖掘所有頻繁閉項(xiàng)集的問題可以被分解成n個(gè)子問題;第j個(gè)問題是找到包含項(xiàng)in+1-j但是不包含ik (n+1-jkn)的完全頻繁閉項(xiàng)集。n引理2如果項(xiàng)集X是一個(gè)頻繁閉項(xiàng)集,則在X條件數(shù)據(jù)庫中不存在項(xiàng)i,使得項(xiàng)i出現(xiàn)在所有的事務(wù)中。n引理3在項(xiàng)集X的條件數(shù)據(jù)庫中,Y是每一個(gè)事務(wù)都出現(xiàn)的項(xiàng)構(gòu)成的最大的集合,如果X沒有以相同的支持度計(jì)數(shù)被已發(fā)現(xiàn)的頻繁閉項(xiàng)集融合,則X是一個(gè)頻繁閉項(xiàng)集。n引理 項(xiàng)融合(item merging)項(xiàng)集X是一個(gè)頻繁項(xiàng)集,如果每個(gè)包含項(xiàng)集X的事務(wù)中都包含有項(xiàng)集,而不

41、是項(xiàng)集的任何一個(gè)直接超集,則XY構(gòu)成頻繁閉項(xiàng)集并且沒有必要再搜索任何包含項(xiàng)集而不含項(xiàng)集的項(xiàng)集。CLOSE算法實(shí)例(1/3)選擇的事務(wù)數(shù)據(jù)庫TIDTIDItems10a,c,d,e,f20a,b,e30c,e,f40a,c,d,f50c,e,f最小支持度計(jì)數(shù)min_sup=2CLOSE算法實(shí)例(2/3)n1、構(gòu)造f_list計(jì)算TDB中每個(gè)項(xiàng)的支持度計(jì)數(shù),得到:a:3 b:1 c:4 d:2 e:4 f:4 min_sup=2f_list:CLOSE算法實(shí)例(3/3)n2、構(gòu)造條件數(shù)據(jù)庫TDBcefadeacefcfadceff_list:a-cond DB(a:3)cefecff-cond D

42、B(f:4)ce:3ce-cond DB(e:4)cd-cond DB(d:2)cefacfaOutput: cfad:2Output: a:3Output: cf:4;cef:3Output: e:4ea-cond DB(ea:2)cOutput: ea:2 CLOSET+ :高效的挖掘頻繁閉項(xiàng)集方法n混合樹投影技術(shù)(hybrid tree-projection method)1、自底向上物理的樹投影(bottom-up physical tree-projection)-高密度的數(shù)據(jù)集2、自頂向下偽樹投影(top-down pseudo tree-projection)-稀疏數(shù)據(jù)集Bott

43、om-up physical tree-projectionf4c4a3b3m 3p3rootf:4c:3c:1m:2b:1b:1b:1p:1a:3p:2m:1(a)Global FP-treec3f2a2m 2f:2c:3m:2a:2root(b)Projected FP-tree with prefix pfcam:2cb:1Top-down pseudo tree-projectionrootf:4c:3c:1m:2b:1b:1b:1p:1a:3p:2m:1fcabmp443333(a)Pseudo tree for prefix ffcabmp443333rootf:4c:3c:1m:

44、2b:1b:1b:1p:1a:3p:2m:1(b)Pseudo tree for prefix cHHTop-down pseudo tree-projection for f:4cabmp32232rootf:4c:3c:1m:2b:1b:1b:1p:1a:3p:2m:1Hf:4amp332rootf:4c:3c:1m:2b:1b:1b:1p:1a:3p:2m:1Hfc:3=Hfcam:3(a)(b)Top-down pseudo tree-projection for f:4amp332rootf:4c:3c:1m:2b:1b:1b:1p:1a:3p:2m:1Hfc:3=Hfcam:3am

45、p332rootf:4c:3c:1m:2b:1b:1b:1p:1a:3p:2m:1Hfc:3=Hfcam:3(c)(d)CLOSET+算法n引理1:項(xiàng)的跳躍(Item skipping)若局部頻繁項(xiàng)i,在不同層次的一些頭表(header tables)中有相同的支持度計(jì)數(shù),則可以安全地將項(xiàng)i從較高的層次的頭表(header tables)中移除。n定理1:子集檢測(cè)(Subset checking)在分治框架下,使用項(xiàng)集融合剪枝方法,通過CLOSET+算法得到的頻繁項(xiàng)集,如果它不能被其它的任何已找到的頻繁閉項(xiàng)集融合,則該項(xiàng)集是頻繁項(xiàng)集。CLOSET+算法n加速子集檢測(cè)方法:(1)兩層哈希索引結(jié)果

46、樹(2)基于向上檢測(cè)的偽投影CLOSET+算法nCLOSET+算法步驟:1、掃描事務(wù)數(shù)據(jù)庫,構(gòu)造f_list。2、使用f_list,通過掃描事務(wù)數(shù)據(jù)庫建立FP-tree。3、使用分治策略分治策略和深度優(yōu)先搜索深度優(yōu)先搜索方法,(1)當(dāng)數(shù)據(jù)集是稀疏稀疏時(shí),使用自頂向下方法挖掘FP-tree來尋找頻繁閉項(xiàng)集。(2)當(dāng)數(shù)據(jù)集是密集密集時(shí),使用自底向上的方法挖掘FP-tree來尋找頻繁閉項(xiàng)集。4、當(dāng)全局頭表上的所有項(xiàng)都被挖掘了,算法結(jié)束。CHARM算法n定義 對(duì)于項(xiàng)集X,定義X對(duì)應(yīng)的事務(wù)id集合(tidset)為t(X):t(X)=xXt(x)對(duì)于事務(wù)id集合,定義對(duì)應(yīng)的項(xiàng)集(itemset)為i(Y

47、):i(Y)= yYi(y)CHARM:使用垂直數(shù)據(jù)格式挖掘閉項(xiàng)集nIT-Tree:Itemset-Tidset Search TreeTIDitems1ACTW2CDW3ACTW4ACDW5ACDTW6CDT設(shè)事務(wù)數(shù)據(jù)庫TDB為:IT-Tree則對(duì)應(yīng)的IT-Tree為:n等價(jià)類(Equivalence Classes)等價(jià)類P=l1,l2,ln,其中P是父節(jié)點(diǎn)(前驅(qū)),且每個(gè)li表示單獨(dú)的項(xiàng)。例如:前面IT-Tree中根節(jié)點(diǎn)的等價(jià)類為: =A,C,D,T,Wn項(xiàng)集的閉包項(xiàng)集X的閉包是包含項(xiàng)集X的最小的閉集。c(X) = i t(X) = i(t(X).n結(jié)論:項(xiàng)集X是閉的,當(dāng)且僅當(dāng)X=c(X

48、)。CHARM算法n定理1設(shè)Xit(Xi)和Xjt(Xj)是類P的任意兩個(gè)成員,如果滿足XifXj,其中f表示一個(gè)全序關(guān)系(如,按字典順序或者基于支持度的順序)。存在如下四條性質(zhì):1.如果t(Xi)=t(Xj),則c(Xi)=c(Xj)=c(XiXj)2.如果t(Xi) t(Xj),則c(Xi) c(Xj),但是滿足c(Xi)=c(XiXj)3.如果t(Xi)?t(Xj),則c(Xi) c(Xj),但是滿足c(Xj)=c(XiXj)4.如果t(Xi) t(xj),則c(Xi) c(Xj)c(Xi Xj)n定理產(chǎn)生的啟發(fā)信息:1、由于c(Xi) = c(Xj) = c(Xi Xj),所以可以用Xi Xj替換掉Xi,并且將Xj從等價(jià)類中移除。2、由于c(Xi Xj) = c(Xi) c(Xj ),所以可以用Xi Xj替換掉Xi;但是由于c(Xi) c(Xj ),所以不能將元素Xj從等價(jià)類中移除。3、與2相似。4、由于c(Xi)c(Xj)c(XiXj),所以Xi和Xj都能各自產(chǎn)生不同的

溫馨提示

  • 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請(qǐng)下載最新的WinRAR軟件解壓。
  • 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請(qǐng)聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
  • 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會(huì)有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
  • 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
  • 5. 人人文庫網(wǎng)僅提供信息存儲(chǔ)空間,僅對(duì)用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對(duì)用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對(duì)任何下載內(nèi)容負(fù)責(zé)。
  • 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請(qǐng)與我們聯(lián)系,我們立即糾正。
  • 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶因使用這些下載資源對(duì)自己和他人造成任何形式的傷害或損失。

最新文檔

評(píng)論

0/150

提交評(píng)論