版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
1
Chapter5MiningAssociationRulesinLargeDatabases第五章
挖掘大型数据库中的关联规则2Chapter5:MiningAssociationRulesinLargeDatabases3AssociationRuleGivenasetoftransactions,findrulesthatwillpredicttheoccurrenceofanitembasedontheoccurrencesofotheritemsinthetransactionMarket-BaskettransactionsExampleofAssociationRules{Milk}
{Bread},
{Beer}
{Diaper},Implicationmeansco-occurrence,notcausality!4AssociationRuleMiningbuys(x,“computer”)
buys(x,“software”)[5%,70%]associationrulesareinterestingifsatisfyingbothaminimumsupportthresholdandaminimumconfidencethresholdstrongassociationrules(强关联规则)Ruleform:A
B[support,confidence]support(A
B)=P(A
B),confidence(A
B)=P(B|A)CustomerbuyssoftwareCustomerbuysbothCustomerbuyscomputer5RuleMeasures:SupportandConfidenceFindalltherulesX&YZwithminimumconfidenceandsupportsupport,s,probabilitythatatransactioncontains{X,Y,Z}confidence,c,
conditionalprobabilitythatatransactionhaving{X,Y}alsocontainsZLetminimumsupport50%,andminimumconfidence50%,wehaveAC(50%,66.6%)CA(50%,100%)6AssociationRuleMinedfromTransactionalDatabases(事务数据库)Given:I={i1,i2,…,im}asetofitems(项)D(task-relevantdata):asetofDBtransactions(事务集合)EachtransactionTIEachtransactionisassociatedwithanidentifierTIDA,B:asetofitems,
A
I;B
IAnassociationruleisanimplicationoftheformAB(whereAB=)whichholdsinthetransactionsetDwithsupportandconfidence7Definition:FrequentItemsetItemsetAcollectionofoneormoreitemsExample:{Milk,Bread,Diaper}k-itemsetAnitemsetthatcontainskitemsSupportcount()FrequencyofoccurrenceofanitemsetE.g.({Milk,Bread,Diaper})=2SupportFractionoftransactionsthatcontainanitemsetE.g.s({Milk,Bread,Diaper})=2/5FrequentItemsetAnitemsetwhosesupportisgreaterthanorequaltoaminsupthreshold(阈值)8Definition:AssociationRuleAssociationRuleAnimplicationexpressionoftheformX
Y,whereXandYareitemsetsExample:
{Milk,Diaper}
{Beer}
RuleEvaluationMetricsSupport(s)FractionoftransactionsthatcontainbothXandYConfidence(c)MeasureshowoftenitemsinY
appearintransactionsthat
containXExample:9AssociationRuleMining
TaskGivenasetoftransactionsT,thegoalofassociationruleminingistofindallruleshavingsupport≥minsupthresholdconfidence≥minconfthresholdTwo-stepprocess
ofassociationrulesmining:
1.Findallfrequentitemsets2.Generatestrongassociationrulesfromthefrequentitemsets10AssociationRuleMining:ARoadMapAssociationrulescanbeclassifiedinvariouswaysasfollows:Booleanvs.quantitativeassociations(Basedonthetypesofvalueshandled)buys(x,“computer”)
buys(x,“software”)[5%,70%]age(x,“30..39”)^income(x,“42..48K”)
buys(x,“PC”)[1%,75%]Singledimensionvs.multipledimensionalassociations(seeex.Above)Singlelevelvs.multiple-levelanalysisage(x,“30..39”)
buys(x,“laptopcomputer”)age(x,“30..39”)
buys(x,“computer”)11Application:WebUsageMiningWebUsageMining:miningclick-stream(点击流)dataAnalyseaccess(browsing)patternsofeachuseratatimeWebsiterestructuresitselfautomaticallybylearningfromuseraccesspatternsExample:WEBPAGES={A,B,C,D,E}Session1={U1,<A,B,C>};Session2={U2,<A,C>};Session3={U3,<B,C,E>};Session4={U1,<A,C,D,C,E>};=>Frequentitemset?12WebServerLog13Application:SequentialPatternsCustomerShoppingSequences(creditcards)Firstbuycomputer,thenCD-ROM,andthendigitalcameraPersonalRecommendationSystemSequenceofinitiatingeventscaused<{cloggedresin}{outletvalveclosure}{lossoffeedwater}{condenserpolisheroutletvalveshut}{boosterpumpstrip}{mainwaterpumptrips}{mainturbinetrips}{reactorpressureincreases}>14Chapter5:MiningAssociationRulesinLargeDatabases15MiningAssociationRulesTwo-stepapproach:FrequentItemsetGenerationGenerateallitemsetswhosesupport
minsupRuleGenerationGeneratehighconfidencerulesfromeachfrequentitemset,whereeachruleisabinarypartitioningofafrequentitemsetFrequentitemsetgenerationisstillcomputationallyexpensive16FrequentItemsetGenerationGivenditems,thereare2dpossiblecandidateitemsets17FrequentItemsetGenerationApproach:EachitemsetinthelatticeisacandidatefrequentitemsetCountthesupportofeachcandidatebyscanningthedatabaseMatcheachtransactionagainsteverycandidateComplexity~O(NMW)=>ExpensivesinceM=2d
!!!18MiningAssociationRules—AnExampleForruleA
C:support=50%confidence=66.6%Apriori:aninfluentialalgorithmforfindingfrequentitemsetsbyusingcandidategenerationProperty:
AnysubsetofafrequentitemsetmustbefrequentMin.support50%Min.confidence50%19MiningFrequentItemsetsApriori:aniterativeapproach(迭代方法)knownaslevel-wisesearch(逐层搜索)Asubsetofafrequentitemsetmustalsobeafrequentitemseti.e.,if{AB}is
afrequentitemset,both{A}and{B}shouldbeafrequentitemsetIterativelyfindfrequentitemsetswithcardinalityfrom1tok(k-itemsets)i.e.,k-itemsetsareusedtoexplore(k+1)-itemsets(Joinoperation)Usethefrequentitemsetstogenerateassociationrules.20FoundtobeInfrequentIllustratingAprioriPrinciplePrunedsupersets21TheAprioriAlgorithmJoinStep(连接):CkisgeneratedbyjoiningLk-1withitselfPruneStep(剪枝):CkisasupersetofLk.Itsmembermayormaynotbefrequent.Any(k-1)-itemsetthatisnotfrequentcannotbeasubsetofafrequentk-itemsetApriorialgorithm:Ck:Candidateitemsetofsizek(inorder)Lk:frequentitemsetofsizek(inorder)D:transactionsetinDB(task-relevantdata)
L1={frequentitems};for(k=1;Lk!=
;k++)dobegin//iterative(迭代)
Ck+1=candidatesgeneratedfromLk;
foreachtransactiontinDdo//scanDincrementthecountofallcandidatesinCk+1thatarecontainedint
Lk+1=candidatesinCk+1withmin_support
endreturn
k
Lk;22TheAprioriAlgorithm—ExampleDatabaseDScanDC1L1L2C2C2ScanDC3L3ScanD23HowtoGenerateCandidates?SupposetheitemsinLk-1arelistedinanorder(次序)Step1:self-joiningLk-1(k>2)insertintoCkselectp.item1,p.item2,…,p.itemk-1,q.itemk-1fromLk-1p,Lk-1qwherep.item1=q.item1,…,p.itemk-2=q.itemk-2,p.itemk-1<q.itemk-1Step2:pruningforallitemsetscinCk
doforall(k-1)-subsetssofcdoif(sisnotinLk-1)thendeletecfromCk24ExampleofGeneratingCandidatesL3={abc,abd,acd,ace,bcd}Self-joining:L3*L3abcdfromabcandabdacdefromacdandacePruning:acdeisremovedbecauseadeisnotinL3L4
={abcd}25MaximalFrequentItemsetBorderInfrequentItemsetsMaximalItemsetsAnitemsetismaximalfrequentifnoneofitsimmediatesupersetsisfrequent26ClosedFrequentItemsetAnfrequentitemsetisclosedifnoneofitsimmediatesupersetshasthesamesupportastheitemset27MaximalvsClosedFrequentItemsetsMinimumsupport=2#Closed=9#Maximal=4ClosedandmaximalClosedbutnotmaximal28MaximalvsClosedItemsets29IsAprioriFastEnough?—PerformanceBottlenecksThecoreoftheApriorialgorithm:Usefrequent(k–1)-itemsetstogeneratecandidatefrequentk-itemsetsUsedatabasescanandpatternmatchingtocollectcountsforthecandidateitemsetsThebottleneckofApriori:candidategenerationHugecandidatesetsMultiplescansofdatabase(transactionset)Question:Canwedesignamethodthatminesthefrequentitemsetswithoutcandidategeneration?30MiningFrequentPatternswithoutCandidateGeneration?Aninterestingmethod:frequent-patterngrowth(频繁模式增长法)Compressalargedatabaseintoacompact,Frequent-Patterntree
(FP-tree)structurehighlycondensed,butcompleteforfrequentpatternminingavoidcostlydatabasescansDevelopanefficient,FP-tree-basedfrequentpatternmining31ConstructFP-treefromaTransactionDB{}f:4c:1b:1p:1b:1c:3a:3b:1m:2p:2m:1HeaderTableItemfrequencyheadf 4c 4a 3b 3m 3p 3min_support=0.5TID Itemsbought (ordered)frequentitems100 {f,a,c,d,g,i,m,p}
{f,c,a,m,p}200 {a,b,c,f,l,m,o}
{f,c,a,b,m}300
{b,f,h,j,o}
{f,b}400
{b,c,k,s,p}
{c,b,p}500
{a,f,c,e,l,p,m,n}
{f,c,a,m,p}Steps:ScanDBonce,findfrequent1-itemset(singleitempattern)OrderfrequentitemsinfrequencydescendingorderScanDBagain,constructFP-tree32MiningFrequentPatternsUsingFP-treeMethodForeachitemoftheheadertable,constructitsconditionalpattern-base(条件模式基),andthenitsconditionalFP-tree(条件FP树)
MajorstepstomineFP-tree(FP树挖掘)ConstructconditionalpatternbaseintheFP-treeConstructconditionalFP-treefromconditionalpattern-baseGeneratethefrequentpatternsfromconditionalFP-trees33Step1:FromFP-treetoConditionalPatternBaseConditionalpatternbasesitem cond.patternbasec f:3a fc:3b fca:1,f:1,c:1m fca:2,fcab:1p fcam:2,cb:1{}f:4c:1b:1p:1b:1c:3a:3b:1m:2p:2m:1HeaderTableItemfrequencyheadf 4c 4a 3b 3m 3p 334Step2:ConstructConditionalFP-tree
Foreachpattern-baseAccumulatethecountforeachiteminthebaseConstructtheFP-treeforthefrequentitemsofthepatternbasem-conditionalpatternbase:fca:2,fcab:1{}f:3c:3a:3m-conditionalFP-treeAllfrequentpatternsconcerningmfm,cm,am,fcm,fam,cam,fcam
{}f:4c:1b:1p:1b:1c:3a:3b:1m:2p:2m:1HeaderTableItemfrequencyheadf 4c 4a 3b 3m 3p 335MiningFrequentPatternsbyCreatingConditionalPattern-BasesEmptyEmptyf{(f:3)}|c{(f:3)}c{(f:3,c:3)}|a{(fc:3)}aEmpty{(fca:1),(f:1),(c:1)}b{(f:3,c:3,a:3)}|m{(fca:2),(fcab:1)}m{(c:3)}|p{(fcam:2),(cb:1)}pConditionalFP-treeConditionalpattern-baseItem36Step3:GeneratingtheFrequentPatternsSupposeanFP-treeThasasinglepathPThecompletesetoffrequentpatternofTcanbegeneratedbyenumerationofallthecombinationsofthesub-pathsofP{}f:3c:3a:3m-conditionalFP-treeAllfrequentpatternsconcerningmfm,cm,am,fcm,fam,cam,fcam
37PrinciplesofFrequentPatternGrowthPatterngrowthpropertyLetbeafrequentitemsetinDB,Bbe'sconditionalpatternbase,andbeanitemsetinB.ThenisafrequentitemsetinDBiffisfrequentinB.“abcdef
”isafrequentpattern,ifandonlyif“abcde”isafrequentpattern,and“f
”isfrequentinthesetoftransactionscontaining“abcde
”38Chapter5:MiningAssociationRulesinLargeDatabases39Multiple-LevelAssociationRulesItemsoftenformhierarchy.Itemsatthelowerlevelareexpectedtohavelowersupport.Strongassociationrulesdiscoveredathighconceptslevelsmayrepresentcommonsenseknowledge.--low-levelassociationrules:
IBMdesktopcomputer
Sonyb/wprinter[6%,50%].--high-level“strong”rules:IBMdesktopcomputer
b/wprinter[10%,60%].Computer
b/wprinter[20%,70%].TIPItems
T1IBMdesktopcomputer,Sonyb/wprinterT2Microsofteducationalsoftware,MicrosoftfinancialsoftwareT3IBMdesktopcomputer,MicrosoftfinancialsoftwareT4IBMdesktopcomputer……40MiningMulti-LevelAssociationsRulesregardingitemsetsatappropriatelevelscouldbequiteuseful.TransactionDBcanbeencodedbasedonlevelsWecanexploresharedmulti-levelminingallSoftwareComputerPrinterDesktopLaptopeducationFinanceColorb/wIBM………Microsoft………HPSony……Level0Level1Level2Level3……41Multi-LevelMining:ProgressiveDeepeningAtop-down,progressivedeepeningapproach:Firstminehigh-levelfrequentitems:Computer(15%),printer(10%)Thenminetheirlower-levelfrequentitems:Desktopcomputer(5%),b/wprinter(4%)Foreachlevel:usingApriori-likealgorithmsfindingfrequentitemsetsthesamemin_supportacrossmulti-levels(uniformsupport一致支持度)reducedmin_supportatlowerlevels(reducedsupport递减支持度)42Multi-levelAssociation:UniformSupportUniformSupport:thesameminimumsupportforalllevels+Oneminimumsupportthreshold(simple).
Noneedtoexamineitemsetscontaininganyitemwhoseancestorsdonothaveminimumsupport.–Lowerlevelitemsdonotoccurasfrequently.IfsupportthresholdtoohighmisslowlevelassociationstoolowgeneratetoomanyhighlevelassociationsComputer[support=10%]LaptopComputer[support=6%]DesktopComputer[support=4%]Level2min_sup=5%Level1min_sup=5%43Multi-levelAssociation:ReducedSupportReducedSupport:reducedminimumsupportatlowerlevelsThereare4searchstrategies:Level-by-levelindependent(逐层独立)Level-crossfilteringbysingleitem(层交叉单项过滤)Level-crossfilteringbyk-itemset(层交叉k-项过滤)Controlledlevel-crossfilteringbysingleitem(受控的层交叉单项过滤)Level1min_sup=5%Level2min_sup=3%Computer[support=10%]LaptopComputer[support=6%]DesktopComputer[support=4%]44Multi-levelAssociation:ReducedSupportLevel-by-levelindependent(逐层独立)Strategy--full-breadthsearch(完全宽度搜索),nobackgroundknowledgeoffrequentitemsetsisusedforpruningLevel-crossfilteringbysingleitem(层交叉单项过滤)--Anitemattheithlevelisexaminediffitsparentnodeatthe(i+1)thlevelisfrequentComputer[support=10%]Level1min_sup=12%Level2min_sup=3%
laptop(notexamined)desktopcomputer(notexamined)45Multi-levelAssociation:ReducedSupportLevel-crossfilteringbyk-itemset(层交叉k-项集过滤)--Ank-itemsetattheithlevelisexaminediffitscorrespondingparentk-
itemsetatthe(i+1)thlevelisfrequentComputerandprinter[support=7%]Laptopcomputerandb/wprinter[support=1%]Laptopcomputerandcolorprinter[support=2%]Desktopcomputerandb/wprinter[support=1%]Desktopcomputerandcolorprinter[support=2%]Level1min_sup=5%Level2min_sup=2%46Multi-levelAssociation:ReducedSupportControlledlevel-crossfilteringbysingleitem(受控的层交叉单项过滤)--levelpassagethreshold(层传递阈值):thisstrategyallowsthechildrenofitemsthatdonotsatisfythemin_supthresholdtobeexaminediftheseitemssatisfythelevelpassagethresholdComputer[support=10%]LaptopComputer[support=6%]DesktopComputer[support=4%]Level1min_sup=12%Level_passage_sup=8%Level2min_sup=3%47Multi-levelAssociation:RedundancyFilteringSomerulesmayberedundantdueto“ancestor”relationships(祖先联系)betweenitems.Exampledesktopcomputerb/wprinter[support=8%,confidence=70%]IBMdesktopcomputerb/wprinter[support=2%,confidence=72%]Wesaythefirstruleisanancestorofthesecondrule.Aruleisredundantifitssupportisclosetothe“expected”value,basedontherule’sancestor.48Chapter5:MiningAssociationRulesinLargeDatabases49Multi-DimensionalAssociation:ConceptsSingle-dimensionalrules(intra-dimension维内关联规则):
miningrulesintransactionaldatabases
buys(X,“IBMdesktopcomputer”)buys(X,“Sonyb/wprinter”)Multi-dimensionalrules:
2dimensionsorpredicatesInter-dimensionassociationrules(维间关联规则)
miningrulesinrelationaldatabasesordatawarehousesage(X,”20-25”)
occupation(X,“student”)buys(X,“laptop”)hybrid-dimensionassociationrules(repeatedpredicates)age(X,”20-25”)
buys(X,“laptop”)buys(X,“b/wprinter”)CategoricalAttributes(分类属性)finitenumberofpossiblevalues,noorderingamongvaluesQuantitativeAttributes(量化属性)numeric,implicitorderingamongvalues50TechniquesforMiningMDAssociationsSearchforfrequentk-predicateset(频繁k-谓词集):Example:{age,occupation,buys}isa3-predicateset.Techniquescanbecategorizedasthreeapproachesregardingthetreatmentofquantitativeattributes(e.g.,howagearetreated)。1.Usingstaticdiscretization(静态离散化)ofquantitativeattributesQuantitativeattributesarestaticallydiscretizedbyusingpredefinedconcepthierarchies.2.Quantitativeassociationrules(量化关联规则)Quantitativeattributesaredynamicallydiscretizedintobasedonthedistributionofthedata.3.Distance-basedassociationrulesThisisadynamicdiscretizationprocessthatconsidersthedistancebetweendatapoints.51StaticDiscretizationofQuantitativeAttributesDiscretizedpriortominingusingconcepthierarchy.Numericvaluesarereplacedbyranges.Inrelationaldatabase,findingallfrequentk-predicatesetsbythestrategysimilartotheApriorialgorithm.Datacubeiswellsuitedformining,Thecellsofann-dimensional
cuboidareusedtostorethecountsofthecorrespondingpredicatesets.Miningfromdatacubes
canbemuchfaster.(income)(age
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- 2025《谏太宗十思疏》内容结构课件
- 2025《祝福》知识分子的旁观课件
- 第6章变量之间的关系 基础测试卷(含解析) 2025-2026学年七年级下册数学北师大版
- 炼焦安全规程培训
- 初中英语必背核心词大全
- 检修部电气二班班长安全责任制培训课件
- 2026年广东省汕头市单招职业倾向性测试题库附答案详解(轻巧夺冠)
- 2026年山西省吕梁市单招职业倾向性考试题库附答案详解(满分必刷)
- 2026年广东金融学院单招职业适应性测试题库带答案详解ab卷
- 2026年广东科贸职业学院单招职业适应性测试题库附参考答案详解(综合卷)
- 《光伏电站项目全过程管理手册》(第一分册前言、开发、测算、审批、配储)
- 护理实践中的慢性病管理和康复服务
- 个人信用的重要性
- DZ/T 0221-2006崩塌、滑坡、泥石流监测规范
- T/CCMA 0133-2022高尔夫球车
- DB31/T 634-2020电动乘用车运行安全和维护保障技术规范
- 肾错构瘤破裂出血护理查房
- 消化道出血的业务学习课件
- 加盟店管理制度
- 职业学院教学管理制度汇编
- 《自动化生产线安装与调试》课件-项目二 供料单元安装与调试
评论
0/150
提交评论