




已阅读5页,还剩135页未读, 继续免费阅读
版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
1,关联规则AssociationRules,2,内容提要,引言Apriori算法Frequent-patterntree和FP-growth算法多维关联规则挖掘相关规则基于约束的关联规则挖掘总结,3,关联规则,关联规则表示了项之间的关系示例:cereal,milkfruit“买谷类食品和牛奶的人也会买水果.”商店可以把牛奶和谷类食品作特价品以使人们买更多的水果.,4,市场购物篮分析,分析事务数据库表我们是否可假定?Chips=SalsaLettuce=Spinach,5,基本概念,通常,数据包含:,事务ID,项的子集,6,关联规则挖掘,在事务数据库,关系数据库和其它信息库中的项或对象的集合之间,发现频繁模式,关联,相关,或因果关系的结构.频繁模式:数据库中出现频繁的模式(项集,序列,等等),7,基本概念,项集事务关联规则事务数据集(例如右图)事务标识TID:每一个事务关联着一个标识,8,度量有趣的关联规则,支持度sD中包含A和B的事务数与总的事务数的比值规则AB在数据集D中的支持度为s,其中s表示D中包含AB(即同时包含A和B)的事务的百分率.,9,度量有趣的关联规则,可信度cD中同时包含A和B的事务数与只包含A的事务数的比值,规则AB在数据集D中的可信度为c,其中c表示D中包含A的事务中也包含B的百分率.即可用条件概率P(B|A)表示.confidence(AB)=P(B|A)条件概率P(B|A)表示A发生的条件下B也发生的概率.,10,度量有趣的关联规则,关联规则根据以下两个标准(包含或排除):最小支持度表示规则中的所有项在事务中出现的频度最小可信度-表示规则中左边的项(集)的出现暗示着右边的项(集)出现的频度,11,市场购物篮分析,I是什么?事务IDB的T是什么?s(Chips=Salsa)是什么?c(Chips=Salsa)是什么?,12,Stepone:频繁项集,项集任意项的集合k-项集包含k个项的项集频繁(或大)项集满足最小支持度的项集若I包含m个项,那么可以产生多少个项集?,13,Steptwo:强关联规则,给定一个项集,容易生成关联规则.项集:Chips,Salsa,BeerBeer,Chips=SalsaBeer,Salsa=ChipsChips,Salsa=Beer强规则是有趣的强规则通常定义为那些满足最小支持度和最小可信度的规则.,14,关联规则挖掘,两个基本步骤Stepone:找出所有的频繁项集满足最小支持度Steptwo:找出所有的强关联规则由频繁项集生成关联规则保留满足最小可信度的规则,15,内容提要,引言Apriori算法Frequent-patterntree和FP-growth算法多维关联规则挖掘相关规则基于约束的关联规则挖掘总结,16,生成频繁项集,Navealgorithmn-|D|foreachsubsetsofIdol-0foreachtransactionTinDdoifsisasubsetofTthenl-l+1ifminimumsupport(class,1)supt=20%(rulesupCount/|D|)*100%confd=66.7%(rulesupCount/condsupCount)*100%,86,RG:算法,1F1=large1-ruleitems;2CAR1=genRules(F1);3prCAR1=pruneRules(CAR1);/counttheitemandclassoccurrencestodeterminethefrequent1-ruleitemsandpruneit4for(k=2;Fk-1;k+)doCk=candidateGen(Fk-1);/generatethecandidateruleitemsCkusingthefrequentruleitemsFk-16foreachdatacasedDdo/scanthedatabaseCd=ruleSubset(Ck,d);/findalltheruleitemsinCkwhosecondsetsaresupportedbyd8foreachcandidatecCddo9c.condsupCount+;10ifd.class=c.classthenc.rulesupCount+;/updatevarioussupportcountsofthecandidatesinCk11end12end,87,RG:算法(cont.),13Fk=cCk|c.rulesupCountminsup;/selectthosenewfrequentruleitemstoformFk14CARk=genRules(Fk);/selecttheruleitemsbothaccurateandfrequent15prCARk=pruneRules(CARk);16end17CARs=kCARk;18prCARs=kprCARk;,88,分类构建器M1:基本概念,给定两个规则riandrj,定义:rirj当ri的可信度大于rj的,或者它们的可信度相同,但ri的支持度大于rj的,或者它们的可信度与支持度都相同,但ri比rj生成的早.我们的分类器如下面格式所示:,whereriR,rarbifba,89,M1:三个步骤,基本思想是选择R中一个优先规则(highprecedence)的集合来覆盖D.对生成的规则集R排序.根据已排好的序列从R中选择为分类所用的规则并放入C每个选择的规则必须正确分类至少一个增加的事例(case).并选择默认的属性和计算误差.抛弃C中的不能改进分类器的准备度的规则.保留那些最小误差的规则的位置,抛弃序列中的其余规则.,90,M1:Algorithm,1R=sort(R);/Step1:sortRaccordingtotherelation“”2foreachrulerRinsequencedo3temp=;4foreachcasedDdo/gothroughDtofindthosecasescoveredbyeachruler5ifdsatisfiestheconditionsofrthen6stored.idintempandmarkrifitcorrectlyclassifiesd;7ifrismarkedthen8insertrattheendofC;/rwillbeapotentialrulebecauseitcancorrectlyclassifyatleastonecased9deleteallthecaseswiththeidsintempfromD;10selectingadefaultclassforthecurrentC;/themajorityclassintheremainingdata11computethetotalnumberoferrorsofC;12end13end/Step214FindthefirstrulepinCwiththelowesttotalnumberoferrorsanddropalltherulesafterpinC;15AddthedefaultclassassociatedwithptoendofC,andreturnC(ourclassifier)./Step3,91,M1:满足的两个条件,Eachtrainingcaseiscoveredbytherulewiththehighestprecedenceamongtherulesthatcancoverthecase.EveryruleinCcorrectlyclassifiesatleastoneremainingtrainingcasewhenitischosen.,92,MOUCLAS,Assumption:MOUCLASalgorithmassumesthattheinitialassociationrulescanbeagglomeratedintoclusteringregionsTheimplicationoftheformoftheMouclasPattern(socalledMP):Cluster(D)tywhereCluster(D)tisaclusterofD,t=1tom,andyisaclasslabel.,93,MOUCLAS,Definitions:FrequencyofMouclasPatternsAccuracyofMouclasPatternsReliabilityofMouclasPatternsTaskofMouclas:TodiscoverMPsthathavesupportandconfidencegreaterthantheuser-specifiedminimumsupportthreshold(calledminsup),andminimumconfidencethreshold(calledminconf)andminimumreliabilitythreshold(calledminR)respectively,andtoconstructaclassifierbaseduponMPs.,94,TheMOUCLASalgorithm,Twosteps:Step1.DiscoveryofthefrequentandaccurateandreliableMPs.Step2.Constructionofaclassifier,calledDe-MP,basedonMPs.,95,TheMOUCLASalgorithm,ThecoreofthefirststepintheMouclasalgorithmistofindallcluster_rulesthatsatisfyminsupandminconfandminR.LetCdenotethedatasetDafterdimensionalityreductionprocessing.Acluster_rulerepresentsaMP,namelyarule:clusety,whereclusetisasetofitemsetsfromaclusterCluster(C)t,yisaclasslabel,yY.,96,TheMOUCLASalgorithm,Algorithmofthefirststep:MouclasMiningfrequentandaccurateandreliableMouclaspatterns(MPs)Input:Atrainingtransactiondatabase,D;minimumsupportthreshold(minsup);minimumconfidencethreshold(minconf);minimumreliabilitythreshold(minR)Output:AsetoffrequentandaccurateandreliableMouclaspatterns(MPs),97,TheMOUCLASalgorithm,Methods:(1)Reducethedimensionalityoftransactionsd,whichefficientlyreducesthedatasizebyremovingirrelevantorredundantattributes(ordimensions)fromthetrainingdata,and(2)IdentifytheclustersofdatabaseCforalltransactionsdafterdimensionalityreductiononattributesAjindatabaseC,basedontheMountainfunction,whichisafuzzysetmembershipfunction,andspeciallycapableoftransformingquantitativevaluesofattributesintransactionsintolinguisticterms,and(3)GenerateasetofMPsthatarebothfrequentandaccurate,namely,whichsatisfytheuser-specifiedminimumsupport(calledminsup)andminimumconfidence(calledminconf)andminimumreliability(calledminR)constraints.,98,TheMOUCLASalgorithm,1X=reduceDim(I);/reducethedimensionalityonthesetofallitemsIofinD2Cluster(C)t=genCluster(C);/identifythecompleteclustersofC3foreachCluster(C)tdoE=genClusterrules(cluset,class);/generateasetofcandidatecluster_rules4foreachtransactiondCdo5Ed=genSubClusterrules(E,d);/findallthecluster_rulesinEwhoseclusetaresupportedbyd6foreacheEddo7e.clusupCount+;/accumulatetheclusupCountoftheclusetofcluster_rulee8ifd.class=e.classthene.cisupCount+/accumulatethecisupCountofcluster_ruleesupportedbyd9end10end11F=eEe.cisupCountminsupi;/constructthesetoffrequentcluster_rules12MP=genRules(F);/generateMPusingthegenRulesfunctionbyminconfandminR13end14MPs=MP;/discoverthefinalsetofMPs,99,TheMOUCLASalgorithm,ThetaskofthesecondstepinMouclasalgorithm:Usingaheuristicmethodtogenerateaclassifier,namedDe-MP,wherethediscoveredMPscancoverDandareorganizedaccordingtoadecreasingprecedencebasedontheirconfidenceandsupport.SupposeRbethesetoffrequentandaccurateandreliableMPswhicharegeneratedinthepaststep,andMPdefault_classdenotesthedefaultclass,whichhasthelowestprecedence.WecanthenpresenttheDe-MPclassifierintheformof,whereMPiR,i=1ton,MPaMPbifnba1anda,bi,CclusetofMPi,100,TheMOUCLASalgorithm,Algorithm:MouclasconstructingDe-MPClassifierInput:Atrainingdatabaseafterdimensionalityreduction,C;ThesetoffrequentandaccurateandreliableMouclaspatterns(MPs)Output:De-MPClassifierMethods:(1)IdentifytheorderofalldiscoveredMPsbasedonthedefinitionofprecedenceandsequencethemaccordingtodecreasingprecedenceorder.(2)DeterminepossibleMPsforDe-MPclassifierfromRfollowingthedescendingsequenceofMPs.(3)DiscardtheMPswhichcannotcontributetotheimprovementoftheaccuracyoftheDe-MPclassifierandkeepthefinalsetofMPstoconstructtheDe-MPclassifier.,101,TheMOUCLASalgorithm,1R=sort(R);/sortMPsbasedontheirprecedence2foreachMPRinsequencedo3temp=;4foreachtransactiondCdo5ifdsatisfiestheclusetofMPthen6stored.IDintemp;7ifMPcorrectlyclassifiesdthen8insertMPattheendofL;9deletethetransactionwhohasIDintempfromC;10selectingadefaultclassforthecurrentL;/determinethedefaultclassbasedonmajorityclassofremainingtransactionsinC11end12computethetotalnumberoferrorsofL;/computethetotalnumberoferrorsthataremadebythecurrentLandthedefaultclass13end14FindthefirstMPinLwiththelowesttotalnumberoferrorsanddiscardalltheMPsaftertheMPinL;15AddthedefaultclassassociatedwiththeabovementionedfirstMPtoendofL;16De-MPclassifier=L,102,ExampleofMOUCLASapplication,Thewellloggingdatasetsincludeattributes(wellloggingcurves)ofGR(gammaray),RDEV(deepresistivity),RMEV(shallowresistivity),RXO(flushedzoneresistivity),RHOB(bulkdensity),NPHI(neutronporosity),PEF(photoelectricfactor)andDT(sonictraveltime).AhypotheticallyusefulMPmaysuggestarelationbetweenwellloggingdataandtheclasslabelofoil/gasformationsince.,103,MiningDistance-basedAssociationRules,BinningmethodsdonotcapturethesemanticsofintervaldataDistance-basedpartitioning,moremeaningfuldiscretizationconsidering:density/numberofpointsinaninterval“closeness”ofpointsinaninterval,104,InterestingnessMeasure:Correlations(Lift),playbasketballeatcereal40%,66.7%ismisleadingTheoverallpercentageofstudentseatingcerealis75%whichishigherthan66.7%.playbasketballnoteatcereal20%,33.3%ismoreaccurate,althoughwithlowersupportandconfidenceMeasureofdependent/correlatedevents:lift,105,内容提要,引言Apriori算法Frequent-patterntree和FP-growth算法多维关联规则挖掘相关规则基于约束的关联规则挖掘总结,106,相关规则(CorrelationRules),“BeyondMarketBaskets,”Brinetal.假设执行关联规则挖掘,tea=coffee20%support80%confidence,but90%ofthepeoplebuycoffeeanyway!,107,相关规则,一种度量是计算相关性若两个随机变量A和B是统计独立的对tea和coffee:,108,相关规则,利用2统计检验来测试独立性设n为购物篮的总数设k为考虑的项的总数设r为一个包含项(ij,ij)的规则设O(r)表示包含规则r的购物篮的数量(即频率)对单个项ij,设Eij=O(ij)(反过来即为n-Eij)Er=n*Er1/n*Erk/n,109,相关规则,2统计量定义为LookupforsignificancevalueinastatisticaltextbookTherearek-1degreesoffreedomIftestfailscannotrejectindependence,otherwisecontigencytablerepresentsdependence.,110,示例,BacktoteaandcoffeeEt=25,Et=75,Ec=90,Ec=10Etc=100*25/100*90/100=22.5O(tc)=20Contrib.to2=(20-22.5)2/22.5=0.278Calculatefortheresttoget2=2.204Notsignificantat95%level(3.84fork=2)Cannotrejectindependenceassumption,111,兴趣度(Interest),If2testshowssignificance,thenwanttofindmostinterestingcell(s)intableI(r)=O(r)/ErLookforvaluesfarawayfrom1I(tc)=20/22.5=0.89I(tc)=5/2.5=2I(tc)=70/67.5=1.04I(tc)=5/7.5=0.66,112,2统计量的性质,上封闭性(Upwardclosed)若一个k-项集是相关的,则其所有的超集也是相关的.寻找最小的相关的项集没有子集是相关的能否将a-prioriand2统计量有效地结合Nogenerateandpruneasinsupport-confidence,113,其它度量(Measures),作用度(Lift)度量项之间的相关性,但没有检验,114,其它度量(Measures),可信度(Conviction)度量一个规则的强度,115,内容提要,引言Apriori算法Frequent-patterntree和FP-growth算法多维关联规则挖掘相关规则基于约束的关联规则挖掘总结,116,基于约束(Constraints)的数据挖掘,Findingallthepatternsinadatabaseautonomously?unrealistic!Thepatternscouldbetoomanybutnotfocused!DataminingshouldbeaninteractiveprocessUserdirectswhattobeminedusingadataminingquerylanguage(oragraphicaluserinterface)Constraint-basedminingUserflexibility:providesconstraintsonwhattobeminedSystemoptimization:exploressuchconstraintsforefficientminingconstraint-basedmining,117,数据挖掘中的约束(Constraints),Knowledgetypeconstraint:classification,association,etc.DataconstraintusingSQL-likequeriesfindproductpairssoldtogetherinstoresinVancouverinDec.00Dimension/levelconstraintinrelevancetoregion,price,brand,customercategoryRule(orpattern)constraintsmallsales(price$200)Interestingnessconstraintstrongrules:min_support3%,min_confidence60%,118,约束(Constrained)挖掘vs.基于约束的搜索,Constrainedminingvs.constraint-basedsearch/reasoningBothareaimedatreducingsearchspaceFindingallpatternssatisfyingconstraintsvs.findingsome(orone)answerinconstraint-basedsearchinAIConstraint-pushingvs.heuristicsearchItisaninterestingresearchproblemonhowtointegratethemConstrainedminingvs.queryprocessinginDBMSDatabasequeryprocessingrequirestofindallConstrainedpatternminingsharesasimilarphilosophyaspushingselectionsdeeplyinqueryprocessing,119,约束频繁项集挖掘:一种挖掘查询优化问题,GivenafrequentpatternminingquerywithasetofconstraintsC,thealgorithmshouldbesound:itonlyfindsfrequentsetsthatsatisfythegivenconstraintsCcomplete:allfrequentsetssatisfyingthegivenconstraintsCarefoundAnavesolutionFirstfindallfrequentsets,andthentestthemforconstraintsatisfactionMoreefficientapproaches:AnalyzethepropertiesofconstraintscomprehensivelyPushthemasdeeplyaspossibleinsidethefrequentpatterncomputation.,120,基于约束挖掘的非单调性,Anti-monotonicityWhenanitemsetSviolatestheconstraint,sodoesanyofitssupersetsum(S.Price)visanti-monotonesum(S.Price)visnotanti-monotoneExample.C:range(S.profit)15isanti-monotoneItemsetabviolatesCSodoeseverysupersetofab,TDB(min_sup=2),121,什么约束是非单调的(Anti-Monotone)?,122,什么约束是单调的(Monotone)?,123,简洁(Succinctness),Succinctness:GivenA1,thesetofitemssatisfyingasuccinctnessconstraintC,thenanysetSsatisfyingCisbasedonA1,i.e.,ScontainsasubsetbelongingtoA1Idea:Withoutlookingatthetransactiondatabase,whetheranitemsetSsatisfiesconstraintCcanbedeterminedbasedontheselectionofitemsmin(S.Price)vissuccinctsum(S.Price)visnotsuccinctOptimization:IfCissuccinct,Cispre-countingpushable,124,什么约束是简洁的(Succinct?),125,Apriori算法示例,DatabaseD,ScanD,C1,L1,L2,C2,C2,ScanD,C3,L3,ScanD,126,NaveAlgorithm:Apriori+Constraint,DatabaseD,ScanD,C1,L1,L2,C2,C2,ScanD,C3,L3,ScanD,Constraint:SumS.price5,127,TheConstrainedAprioriAlgorithm:PushanAnti-monotoneConstraintDeep,DatabaseD,ScanD,C1,L1,L2,C2,C2,ScanD,C3,L3,ScanD,Constraint:SumS.price5,128,约束的Apriori算法:PushaSuccinctConstraintDeep,DatabaseD,ScanD,C1,L1,L2,C2,C2,ScanD,C3,L3,ScanD,Constraint:minS.price=1,129,转换“强”约束(ToughConstraints),Converttoughconstraintsintoanti-monotoneormonotonebyproperlyorderingitemsExamineC:avg(S.profit)25Orderitemsinvalue-descendingorderIfanitemsetafbviolatesCSodoesafbh,afb*Itbecomesanti-monotone!,TDB(min_sup=2),130,可变约束(ConvertibleConstraints),LetRbeanorderofitemsConvertibleanti-monotoneIfanitemsetSviolatesaconstraintC,sodoeseveryitemsethavingSasaprefixw.r.t.REx.avg(S)vw.r.t.itemvaluedescendingorderConvertiblemonotoneIfanitemsetSsatisfiesconstraintC,sodoeseveryitemsethavingSasaprefixw.r.t.REx.avg(S)vw.r.t.itemvaluedescendingorder,131,可变约束的挖掘,C:avg(S.profit)25ListofitemsineverytransactioninvaluedescendingorderR:Cisconvertibleanti-monotonew.r.t.RScantransactionDBonceremoveinfrequentitemsItemhintransaction40isdroppedItemsetsaandfaregood,TDB(min_sup=2),132,强可变约束(ConvertibleConstraints),avg(X)25isconvertibleanti-monotonew.r.t.itemvaluedescendingorderR:IfanitemsetafviolatesaconstraintC,sodoeseveryitemsetwithafasprefix,suchasafdavg(X)25isconvertiblemonotonew.r.t.itemvalueascendingorderR-1:IfanitemsetdsatisfiesaconstraintC,sodoesitemsetsdfanddfa,whichhavingdasaprefixThus,avg(X)25isstron
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- 2025年学历类自考专业(小学教育)比较教育-中小学教育管理参考题库含答案解析(5套)
- 2020建筑方案设计答案(3篇)
- 2025年垃圾填埋气发电技术创新与环保产业融合发展报告
- 2025年学历类自考专业(小学教育)小学班主任-小学班主任参考题库含答案解析(5套)
- 医院电子病历系统2025年升级方案:优化与临床决策支持报告
- 月嫂培训知识宝典班课件
- 2025年学历类自考专业(学前教育)学前教育政策与法规-幼儿园课程参考题库含答案解析(5套)
- 2025年学历类自考专业(学前教育)学前教育学-幼儿园教育基础参考题库含答案解析(5套)
- 2025年学历类自考专业(学前教育)学前教育史-幼儿园教育活动设计与组织参考题库含答案解析(5套)
- 智能化康复医疗器械2025年市场需求与产品创新解决方案研究报告
- 2025年企业劳动者雇佣合同样本
- 安徽省高一英语必修一单词表
- 企业级实验设备的投资回报分析方法
- DB37T 5133-2019 预制双面叠合混凝土剪力墙结构技术规程
- 老年上消化道出血急诊诊疗专家共识(2024版)解读
- 顺产产后护理查房
- 《糖尿病饮食教育》课件
- 承包村里集体建设用地协议范文
- 胸腰椎骨折的康复治疗
- 第五讲铸牢中华民族共同体意识-2024年形势与政策
- 软件系统技术报告模板
评论
0/150
提交评论