一种改进的Apriori关联规则挖掘算法_英文_.doc_第1页
一种改进的Apriori关联规则挖掘算法_英文_.doc_第2页
一种改进的Apriori关联规则挖掘算法_英文_.doc_第3页
一种改进的Apriori关联规则挖掘算法_英文_.doc_第4页
一种改进的Apriori关联规则挖掘算法_英文_.doc_第5页
免费预览已结束,剩余1页可下载查看

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

一种改进的 Apriori 关联规则挖掘算法张广路1 ,雷景生2 ,吴兴惠1(1 . 海南师范大学 数学与统计学院 ,海南 海口 571158 ;2 . 南京邮电大学 信息与技术学院 ,江苏 南京 211815)摘 要 :关联规则挖掘是数据挖掘中的一个重要研究内容。为了高效、快速地从事务数据库中挖掘出频繁项集,针对数据挖掘的经典关联规则 Apriori 算法的瓶颈问题提出了改进的方法。算法将事物数据库映射到布尔型数组中,然后所有的操 作都针对数组元素值展开。这样大大减少了数据库的扫描次数。算法利用数组的随机访问特性及布尔型数据的简单 “与”操作 ,直接产生频繁项集 ,而不产生大量的候选项集。经理论分析和实验结果显示该算法在效率上明显优于 Apriori 算法。关键词 :数据挖掘 ;关联规则 ;Apriori 算法 ;频繁项集中图分类号 : TP311文献标识码 :A文章编号 :1673 - 629 X( 2010) 06 - 0084 - 05An Improved Apriori Algorithm f or Min ing Assoc iat ion RulesZHAN G Guang2lu1 ,L E I J ing2sheng2 ,WU Xing2hui1(1 . School of Mat hematics and Statistics , Hainan Normal U niversity , Hai kou 571158 ,China ;2 . School of Informatio n Science and Technology ,Nanjing U niversity of Post s andTeleco mmunicatio ns ,Nanjing 211815 ,China)Abstract :Associatio n rule mining is an impo rtant part of research co ntent in data mining. In o rder to efficiently and quickly mine all f re2quent itemset f ro m t he t ransactio n database ,an imp roved algo rit hm of mining associatio n rules is p resented fo r t he bot tleneck p ro blem oft he classic Ap rio ri algo rit hm. The t ransactio n database is mapped to Bool array , t hen all t he operatio ns are carried o ut based o n array ele2 ment s value , t hereby reducing t he database scanning f requency. Then use bit wise“AND”operatio n and rando m access characteristics of array , a direct co nsequence of f requent itemset s , rat her t han have a large number of candidate set s , t hereby imp roving t he efficiency oft he algo rit hm.Key words :data mining ;associatio n rules ;Ap rio ri algo rit hm ;f requent itemsetp redeter mined minimum support count , min - sup . Sec2 o nd , generating interesting associatio n rules f ro m t he f re2 quent itemset s , i . e . , t hese rules must satisfy minimum support and minimum co nfidence . Because t he seco nd step is much less costly t han t he first step , t he overall perfor2 mance of mining associatio n rules is deter mined by t he first step , so mining associatio n rules is usually co nverted to mining f requent itemset s. Nowadays , t here have manyworks1 5 focus o n f requent itemset s mining. Frequentpat tern mining of ten generates a very large number of pat2terns and rules , which reduces not o nly t he efficiency but0IntroductionMining Associatio n Rules is a research field by p ro2posed earlier , as an important part of data mining , t hathas experienced t he develop ment of a lo nger period of time. Associatio n rule mining aims to find rules in t he t ransactio ns database D wit h user given minimum support and minimum co nfidence , t hat can be seen as t wo p ro2 cesses : First , finding all f requent itemset s , i . e . , each oft hese itemset s will be occurred at least as f requently as a收稿日期 :2009 - 10 - 26 ;修回日期 :2010 - 01 - 22基金项目 :海南省自然科学基金资助项目 ( 808155 ,109002) ; 海南师 范大学青年教师科研启动资助项目 ( QN0923) ; 海南师范大学教改 项目( HSJ G0924)作者简介 : 张广路 ( 1978 - ) , 女 , 山东泰安人 , 讲师 , 硕士 , 研究方向 为数据挖掘技术、数据库系统。6 ,7also t he effectiveness of mining. To efficiently solve8t he p roblem , t raditio nal algorit hms like Ap rioriand it svariant s9 11 focus o n reducing t he number of database scans as well as cut ting down t he enumeratio n space ofcandidate itemset s ( CI s ) . By t he downward closurep ropert y of F Is - allsubset s of a f requent itemset must al2so be f requent - t hey can iteratively enumerate ,p rune ,and test 1 - extensio n superset s of existing F Is and end up wit h much reduced number of CIs. The major p roblemwit h t hese algorit hms is t hat t heir repeated search of p rof use itemset s against t he huge amount of t ransactio ns t urns out to be very time - co nsuming and wastef ul . For densedatabases wit h enor mous lo ng F Is , t he breadt h - firstsearch of t he itemset lat tice and subset testing to p rune CIs by existing F Is are bot h space and time co nsuming. Anot her innovative algorit hm targeting dense databases isFP - growt h12 . It co mpact s t he repetitive t ransactio ns into so me co ncise FP - t ree . Itemset s are organized in t hat f requency - ordered p refix t ree so t hat t hey share co mmo n p refix part as much as possible . This app roach can cut down t he database sizes and reduce repeated co mp utatio n for dense databases. In t his paper , an effective mining al2 gorit hm is p roposed based o n t wo - dimensio nal Bool ar2ray. Algorit hm uses bit wise“and”operator and a rando maccess characteristics of array , mine f requent k - itemset sL k , but does not have a generatio n of candidate itemset s ,which more efficient generate f requent k - itemset s.database D o nce , to map t he database to t he t wo - di2mensio nal Bool array , t hen mine allf ro m t he array.f requent itemset sApriori Algo22Description of ImprovedrithmAlgorit hm uses t he vertical exp ress for mat ofdatabases , t hereby reducing t he time and space cost . Database , t here are t wo exp ress for mat : t he horizo ntal data for mat and vertical data for mat ( P. Shenoy5 and ot hers have verified t he vertical data for mat t hat is bet ter t han t he horizo ntal data for mat) . Table 1 and table 2 givet wo exp ress for mat s. Based o n t he vertical data for mat al2gorit hm is p roposed in t his paper : Each item is exp ressedwit h a binary array element s. Let Tdarray mn ist he Bool t wo - dimensio nal array t hat exp ress t he entiredatabase , t hat is :I u T vI u |T v, m , v = 0 , 1 , 2 ,10TDarray uv =In which u = 0 , 1 , 2 , n .The Applicatio n of Bool array for mat aims to savetime of calculating t he support count , and reduces t he number of scaning array element s. Because t he Bool array have a rando m access feat ures and can perfor m a simply1Problem DescriptionLet D ,be a set of t ransactio n database ;let I = I0 ,bit wise “and ”operator . Therefore get bet ter timespace efficiency. Table 1 Ho rizo nt al dat a fo r mat and, I m be a set of items , Ii is t he first i item ,LetI1 , I2 ,T = T0 , T1 , T2 , T n be a set of t ransactio ns in D ,Ti is t he first i t h t ranscatio n ,which by I t he n - co mpo2T IDItemsT 0T 1I 0 , I 1 , I 2 , I 4 , I 5I 1 , I 3 , I 4nent s t hat can be exp ressed as T i = t i d , x 1 , x 2 , x 3 ,x n ,of which x i I ( 1 i n) , n is t he size of t ransac2tio ns , t i d is a t ransactio n identifier . A set of items is re2 ferred to be as itemset . An itemset t hat co ntains k items is a k - itemset . The occurrence f requency of an itemset ist he number of t ransactio nt s t hat co ntain t he itemset ,t hat is also known , simply , as t he f requency , support count of t he itemset ,if t he support count of an itemset I satisfies ap respecified minimum support t hreshold , t hen I is a f re2quency itemset , t he set of f requency k - itemset is co m2TI , I20 2T 3T 4T 5T 6I 1 , I 2 , I 4I 0 , I 1 , I 2I 1I 3 , I 4 , I 5 T 7 I 1 , I 2 , I 3 , I 4 Table 2Vertical data for matItemTidsetI 0I 1I 2I 3I 4I 5T 0 , T 2 , T 4T 0 , T 1 , T 3 , T 4 , T 5 , T 7T 0 , T 2 , T 3 , T 4 , T 7T 1 , T 6 , T 7T 0 , T 1 , T 2 , T 3 , T 6 , T 7T 0 , T 7mo nly denoted as L k . e . g. X = I1 , I2 , I3 , Ik , ofwhich sup ( X ) exp ressed t he t ransactio n number of X , if sup ( X) min - sup ,t hen itemset X is f requency k - itemset which be known L k . In t his paper , t he task is to de2sign an algorit hm in a given user - defined minimum sup2port and minimum co nditio ns , o nly by scanning ofAlgorithms - Relevant Funda men22 . 1Description oftal PropertyIn order to bet ter explain algorit hms , now relates toalgorit hms - relevant f undamental p roperties and charac2teristics described as follows :Propert y 1 :All no nempt y subset s of a f requent item2 set must also be f requent , and all supereset s of a no nf re2 quentcy itemset must also be not f requent .Propert y 2 : if t he item l i and l j of t he f requent ( k -1 - ) itemset s are not satisfied wit h t he joining co nditio n2 al ,t hen t he item l i and all items af ter item l j are not satis2 fied wit h t he joining co nditio nal .This p ropert y is a kind of optimizatio n of t he join co nditio ns for Ap riori algorit hm ,because all items of item2set about Ap riori algorit hm are sorted in lexicograp hic or2The database is mapped to t he t wo - dimensio nalBool array by scanning t he database o nce . At t he same time ,define a o ne - dimensio nal array which will stat t he number of t ransactio n co ntains o ne item in database . Then in accordance wit h t he result of t he stating to gener2 ate t he f requent 1 - itemset L 1 , and let candidate f re2 quent1 - itemset C1 = L 1 . The join step .The join , Ck - 1 Ck - 1 ( k f ro m 2 to start ) is per2for med ,generating t he set of f requent k - itemset s wit h2out t he Candidate k - itemset , where members ofCk - 1are joinable if t heir first ( k - 2) items are in co mmo n andt he p ropert y 2 is satisfied by t he first k - 1 items. At t he same time ,calculate t he support count of itemset c gener2, l k - 1 , each l i L k - 1der . Now let L k - 1 = l 1 , l 2 , l 3 ,and each l j L k - 1 ( 1 i j k - 1) ,if t here is not havet he equatio n of l i 1 = l j 1 l i 2 = l j 2 l i k- 2 = l j k - 2 l i k - 1 l j k - 1 ,t hen t he equa2tio n of l i 1 = l j + 1 1 l i 2 = l j + 1 2 l i k -2 = l j + 1 k - 2 l k - 1 l j + 1 k - 1 is not satis2 fied. Therefore ,it is not need to deter mine t he possibilit y of joining of t he item l i and all items af ter t he item of l j .So make use of itemset s characteristics bet ween t he order2 ly operatio n to reduce t he number of co nnectio ns in order to imp rove algorit hm efficiency.Propert y 3 : If let L k - 1 is a set of f requent k - 1 itemset where each itemset ci is a set of item and Ij cisuch t hat | L k - 1 ( Ij ) | k - 1 (of which | L k - 1 ( Ij ) | ex2 p ress t he number of t he occurrence in f requent itemset s L k - 1 ) ,t hen all candidate k itemset s generated by All ele2 ment s of no n - co nnect to c in L k - 1 are not f requent itemset s.Use t he p ropert y to reduce t he size of f requent ( k -1) - itemset s so as to reduce t he number of co mpariso n of generating f requent k - itemset s. So before generating L k , L k - 1 can be perfor med t he p runing t reat ment . Be2 cause if t he items Ij in t he collectio n appear in L k - 1 is lesst han t he number of K - 1 , t hen af ter t he join , L k - 1 L k - 1 ,support count s of all itemset s which include item Ij are less t hen k , so itemset s which include item Ij will be removed f ro m itemset s L k - 1 , Thereby imp roving t he effi2ciency of algorit hm., Ik ,support ( c)ated. For examble let c = I0 , I1 , I2 ,TDarray 1TDarray 2= sum ( TDarray0TDarray k ) = TDarray 0 0TDarray k 0 + TDarraTDarray 1 0y 0 1TDarray 1 1TDarrayn TDar2k 1 + TDarray 0n TDarray kn . If su pport ( c) ray 1min - sup ,t hen L k = L k c is perfor med. The p rune step .Based o n t he p ropert y 3 , counting t he number t hatI u ( I u I)t he array ofoccurrence in L k and recording t he result inA k u ( here A k u refers to t he numbert hat I u ( I u I) occurrence in L k ) . If A k u k , t henCk = L k - c is perfor med ,of which c L k and I u c .Repeat t his p rocess until t he candidate f requently generate k - itemset s denoted Ck . The p urpose of perfor ming p rune step is to reduce t he size of k - itemset s and to imp rove t he efficiency of algorit hms. And t hen k + + be perfor med and back to step s 2 until Ck 2 .Algorit hm design : Imp roved Ap riori Algorit hmInp ut :3 D : a database of t ransactio ns ;3 min - sup :t he minimum support count t hreshold. Outp ut : L ,f requent item- set s in D .Met hod :( 1) for (i = 0 ;i = m ;i + + ) / / initializatio n for (j = 0 ;j = n ;j + + )2 . 2Algorithm Implementation Steps and AlgorithmDesignThe imp roved Ap riori algorit hm of mining f requent( 2) if ( Ii Tj) TDarrayij = 1 ;( 3 ) B i + + ; / / count t he t ransactio n numberwhich include item Ii( 4) else TDarrayiitemset s will be divided into t hree perfor ming step s : The initializatio n step .j = 0 ;(5) for (j = 0 ;j = min - sup) L 1 = L1 Ij ;(7) else delete TDarrayi ;/ / remove t he row value of unf ruitf ul item , packed array( 8) f ree (B) ; / / saving space( 9) C1 = L1 ;( 10) for ( k = 2 ;| Ck| 2 ; k + + ) ( 11) L k = find - f requent - k - itemset s ( Ck - 1 , TDarray( 5) rut urn Ck ;3Algorithm Analysis3 . 1Exa mples AnalysisIn order to bet ter explain t he algorit hm implementa2 tio n p rocess , now takes 8 records f ro m t he database to show . Transactio n database as shown in table 2 . The p re2 for ming p rocess is as follows :The t ransactio ns in D are scaned in order to ini2tializatio n array TDarray4 ,10 , t hat correspo nds to t he value of t he array element s such as shown in table 3 , att he same time count s t he number of occurrences indatabase of each item and records t he result in t he array ofm n ,min- sup) ;/ / generate f requent k - itemset( 12) for each itemset c L k( 13) for each item Ij c(14) Ak j = I j . count + + ; / /Statistics t he timest hat Ij ( Ij I) appear in L k and records t he result in t he ar2ray of Ak uB 4in order to generates t he f requent 1 - itemset .(15 ) Ck =TDarray - gen ( L k , Ak m)/ /p runeTable 3Array element value listsetp :call porcedure TDarray - genT 0T 1T 2T 3T 4T 5T 6T 7( 16)( 17) ( 18)If ( | Ck| 2)Ret urn L k - 1 ;I 0I 1I 2I 3I 4I 5111011010110101000011010111000010000000110011111p rocedure find - f requent - k - itemset s ( Ck - 1 , TDar2- sup)raym n ,min( 1) for each itemset li Ck - 1( 2) for each itemset lj Ck - 1Suppose t hat t he minimum support count t hresh2old required 2 , t hat is , min - sup = 2 ( Here t he corre2spo nding relative support is 2/ 8 = 25 %) . The set of f re2quent 1 - itemset s is denoted L 1 , t hat can be deter mined( 3) if (li 1 = lj 1li 2 = lj 2j k -li 3 = lj 3lili k - 1 lk - 2 = lj k - 2p ropert y 21)/ / based o n 4 ( 4) c = li lj/ / join setp :( 5) for ( u = 0 ; u = n ; u + + ) (6) if ( Iu c) C n = TDarravalue of t he first item to array Cn( 7) break ;( 8) for (j = u + 1 ;j = n ;j + + ) ( 9) if ( Ij c)by scaning array B. L 1 = I0 , I1 , I2 ,C1 = L 1 . f ree B 4 .I3 , I4 , I5 . Lety u ; / / copy t he The join , C1 C1 is perfor med in order togenerate t he set off requent 2 - itemset s which is denoted L 2 . L 2 = I0 , I1 , I0 , I2 , I1 , I2 , I1 , I3 , I1 , I4 , I2 , I4 , I3 , I4 , I4 , I5 Based o n p ropert y 3 , p runes L 2 , t hen generetest he set of candidate f requent 2 - itemset s t hat is denoted( 10)( 11)Vmat rix j ;Cn = Cnd = support ( Cn ) ;A2 5C2 ,because min - sup)( 13) L k = c L k ; / / generate f requent k - itemsetinclude item I2 will be removed f ro m L 2 . So C2 = I0 ,I1 , I0 , I2 , I1 , I2 , I1 , I3 , I1 , I4 , I2 , I4 , I3 , I4 The join , C2 C2 is perfor med in order togenerate t he set of f requent 3 - itemset s which is denotedL 3 .Before perfor ming t he C2 C2 , t he items in t he set of C2 are checked in l 1 , l 2 , l 3 , l 4 , l 5 , l 6 , l 7 , we find l 1 and l 3 not satisfied t he co nnectio n co nditio ns of algorit hm( 14)( 15) ( 16)else break ; / / stop Co mpariso n and joinp rocedure TDarray - gen ( Ck ,L k ,Ak m( 1) for (j = 0 ;j Ak . legt h ;j + + )( 2) If ( Ak j k)( 3) for each c L k and Ij c( 4) Ck = L k - c ;Fig 1Generatio n of f requent itemset and candidate itemset ,where min- sup is 2find- f requent - k - itemset s ( ) ,t hen t he join operatio n of l 1and l 4 , l 5 , l 6 , l 7 not be perfor med ,too . So we can break in it ,perfor m l 2 l 3 and ot hers. In t he end L 3 = I0 ,I1 , I2 , I1 , I2 , I4 , I1 , I3 , I4 .By setp ,can see t he set of candidate f requent 3- itemset s is empt y ,so t he p rocess is stop .We use figure 1 to illust rate t he imp roved Ap riori al2gorit hm for finding f requent itemset s in D .Fro m t he algorit hm design and algorit hm perfor m - ing p rocess analysis , imp roved Ap riori algorit hm ( be de2 noted A TMC1) , mainly f ro m t he following point s bet tert han t he classic Ap riori algorit hm ( be denoted A TMC2) :t he number of scanning database ; t he number of generat2 ing a set of candidate itemset s ; t he number of join and co mparing and space utilizatio n . In t his paper , t he imple2 mentatio n of t he example described in table

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论