




已阅读5页,还剩116页未读, 继续免费阅读
版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
29.05.2020,DataMining:PrinciplesandAlgorithms,1,DataMining:ConceptsandTechniquesChapter99.3.MultirelationalDataMining,JiaweiHanandMichelineKamberDepartmentofComputerScienceUniversityofIllinoisatUrbana-C/hanj2006JiaweiHanandMichelineKamber.Allrightsreserved.Acknowledgements:XiaoxinYin,29.05.2020,DataMining:PrinciplesandAlgorithms,2,Multi-RelationalandMulti-DBMining,Classificationovermultiple-relationsindatabasesClusteringovermulti-relationsbyUser-GuidanceMiningacrossmulti-relationaldatabasesMiningacrossmultipleheterogeneousdataandinformationrepositoriesSummary,29.05.2020,DataMining:PrinciplesandAlgorithms,3,MultirelationalDataMining,Classificationovermultiple-relationsindatabasesClusteringovermulti-relationsbyuser-guidanceLinkClus:EfficientclusteringbyexploringthepowerlawdistributionDistinct:DistinguishingobjectswithidenticalnamesbylinkanalysisMiningacrossmultipleheterogeneousdataandinformationrepositoriesSummary,29.05.2020,DataMining:PrinciplesandAlgorithms,4,Outline,Theme:“Knowledgeispower,butknowledgeishiddeninmassivelinks”,StartingwithPageRankandHITSCrossMine:Classificationofmulti-relationsbylinkanalysisCrossClus:Clusteringovermulti-relationsbyuser-guidanceMorerecentworkandconclusions,29.05.2020,DataMining:PrinciplesandAlgorithms,5,TraditionalDataMining,Workonsingle“flat”relations,flatten,Contact,LoseinformationoflinkagesandrelationshipsCannotutilizeinformationofdatabasestructuresorschemas,29.05.2020,DataMining:PrinciplesandAlgorithms,6,Multi-RelationalDataMining(MRDM),MotivationMoststructureddataarestoredinrelationaldatabasesMRDMcanutilizelinkageandstructuralinformationKnowledgediscoveryinmulti-relationalenvironmentsMulti-relationalrulesMulti-relationalclusteringMulti-relationalclassificationMulti-relationallinkageanalysis,29.05.2020,DataMining:PrinciplesandAlgorithms,7,ApplicationsofMRDM,e-Commerce:discoveringpatternsinvolvingcustomers,products,manufacturers,Bioinformatics/Medicaldatabases:discoveringpatternsinvolvinggenes,patients,diseases,Networkingsecurity:discoveringpatternsinvolvinghosts,connections,services,ManyotherrelationaldatasourcesExample:EvidenceExtractionandLinkDiscovery(EELD):ADARPA-fundingprojectthatemphasizesmulti-relationalandmulti-databaselinkageanalysis,29.05.2020,DataMining:PrinciplesandAlgorithms,8,ImportanceofMulti-relationalClassification(fromEELDProgramDescription),TheobjectiveoftheEELDProgramistoresearch,develop,demonstrate,andtransitioncriticaltechnologythatwillenablesignificantimprovementinourabilitytodetectasymmetricthreats,e.g.,alooselyorganizedterroristgroup.Patternsofactivitythat,inisolation,areoflimitedsignificancebut,whencombined,areindicativeofpotentialthreats,willneedtobelearned.Addressingthesethreatscanonlybeaccomplishedbydevelopinganewlevelofautonomicinformationsurveillanceandanalysistoextract,discover,andlinktogethersparseevidencefromvastamountsofdatasources,indifferentformatsandwithdifferingtypesanddegreesofstructure,torepresentandevaluatethesignificanceoftherelatedevidence,andtolearnpatternstoguidetheextraction,discovery,linkageandevaluationprocesses.,29.05.2020,DataMining:PrinciplesandAlgorithms,9,MRDMApproaches,InductiveLogicProgramming(ILP)FindmodelsthatarecoherentwithbackgroundknowledgeMulti-relationalClusteringAnalysisClusteringobjectswithmulti-relationalinformationProbabilisticRelationalModelsModelcross-relationalprobabilisticdistributionsEfficientMulti-RelationalClassificationTheCrossMineApproachYinetal,2004,29.05.2020,DataMining:PrinciplesandAlgorithms,10,InductiveLogicProgramming(ILP),Findahypothesisthatisconsistentwithbackgroundknowledge(trainingdata)FOIL,Golem,Progol,TILDE,BackgroundknowledgeRelations(predicates),Tuples(groundfacts),Trainingexamples,Backgroundknowledge,29.05.2020,DataMining:PrinciplesandAlgorithms,11,InductiveLogicProgramming(ILP),HypothesisThehypothesisisusuallyasetofrules,whichcanpredictcertainattributesincertainrelationsDaughter(X,Y)female(X),parent(Y,X),29.05.2020,DataMining:PrinciplesandAlgorithms,12,FOIL:First-OrderInductiveLearner,FindasetofrulesconsistentwithtrainingdataE.g.female(X),parent(Y,X)daughter(X,Y)Atop-down,sequentialcoveringlearnerBuildeachrulebyheuristicsFoilgainaspecialtypeofinformationgain,ExamplescoveredbyRule1,Allexamples,ExamplescoveredbyRule2,ExamplescoveredbyRule3,29.05.2020,DataMining:PrinciplesandAlgorithms,13,ILPApproaches,Top-downApproaches(e.g.FOIL)while(enoughexamplesleft)generatearuleremoveexamplessatisfyingthisruleBottom-upApproaches(e.g.Golem)UseeachexampleasaruleGeneralizerulesbymergingrulesDecisionTreeApproaches(e.g.TILDE),29.05.2020,DataMining:PrinciplesandAlgorithms,14,ILPProsandCons,AdvantagesExpressiveandpowerfulRulesareunderstandableDisadvantagesInefficientfordatabaseswithcomplexschemasNotappropriateforcontinuousattributes,29.05.2020,DataMining:PrinciplesandAlgorithms,15,AutomaticallyClassifyingObjectsUsingMultipleRelations,Whynotconvertmultiplerelationaldataintoasingletablebyjoins?Relationaldatabasesaredesignedbydomainexpertsviasemanticmodeling(e.g.,E-Rmodeling)IndiscriminativejoinsmayloosesomeessentialinformationOneuniversalrelationmaynotbeappealingtoefficiency,scalabilityandsemanticspreservationOurapproachtomulti-relationalclassification:Automaticallyclassifyingobjectsusingmultiplerelations,29.05.2020,DataMining:PrinciplesandAlgorithms,16,AnExample:LoanApplications,Applyforloan,Approveornot?,Askthebackenddatabase,29.05.2020,DataMining:PrinciplesandAlgorithms,17,TheBackendDatabase,Howtomakedecisionstoloanapplications?,29.05.2020,DataMining:PrinciplesandAlgorithms,18,Roadmap,MotivationRule-basedClassificationTupleIDPropagationRuleGenerationNegativeTupleSamplingPerformanceStudy,29.05.2020,DataMining:PrinciplesandAlgorithms,19,Rule-basedClassification,Everboughtahouse,LiveinChicago,Approve!,Justapplyforacreditcard,Reject,Applicant,Applicant,29.05.2020,DataMining:PrinciplesandAlgorithms,20,RuleGeneration,Applicant#1,Applicant#2,Applicant#3,Applicant#4,LoanApplications,Accounts,Orders,Districts,Otherrelations,Searchforgoodpredicatesacrossmultiplerelations,29.05.2020,DataMining:PrinciplesandAlgorithms,21,PreviousApproaches,InductiveLogicProgramming(ILP)TobuildaruleRepeatedlyfindthebestpredicateToevaluateapredicateonrelationR,firstjointargetrelationwithRNotscalablebecauseHugesearchspace(numerouscandidatepredicates)NotefficienttoevaluateeachpredicateToevaluateapredicateLoan(L,+):-Loan(L,A,?,?,?,?),Account(A,?,monthly,?)firstjoinloanrelationwithaccountrelationCrossMineismorescalableandmorethanonehundredtimesfasterondatasetswithreasonablesizes,29.05.2020,DataMining:PrinciplesandAlgorithms,22,CrossMine:AnEfficientandAccurateMulti-relationalClassifier,Tuple-IDpropagation:anefficientandflexiblemethodforvirtuallyjoiningrelationsConfinetherulesearchprocessinpromisingdirectionsLook-one-ahead:amorepowerfulsearchstrategyNegativetuplesampling:improveefficiencywhilemaintainingaccuracy,29.05.2020,DataMining:PrinciplesandAlgorithms,23,Roadmap,MotivationRule-basedClassificationTupleIDPropagationRuleGenerationNegativeTupleSamplingPerformanceStudy,29.05.2020,DataMining:PrinciplesandAlgorithms,24,TupleIDPropagation,0+,0,0+,1,0+,1,2+,0,Labels,1,2,02/27/93,monthly,124,Null,01/01/97,weekly,67,4,12/09/96,monthly,45,3,09/23/97,weekly,108,PropagatedID,Opendate,Frequency,AccountID,Applicant#1,Applicant#2,Applicant#3,Applicant#4,PropagatetupleIDsoftargetrelationtonon-targetrelationsVirtuallyjoinrelationstoavoidthehighcostofphysicaljoins,Possiblepredicates:Frequency=monthly:2+,1Opendatethresholdthenaddptocurrentruleelsebreak,Positiveexamples,Negativeexamples,A3=1,A3=1&A1=2,A3=1&A1=2&A8=5,29.05.2020,DataMining:PrinciplesandAlgorithms,29,EvaluatingPredicates,AllpredicatesinarelationcanbeevaluatedbasedonpropagatedIDsUsefoil-gaintoevaluatepredicatesSupposecurrentruleisr.Forapredicatep,foil-gain(p)=CategoricalAttributesComputefoil-gaindirectlyNumericalAttributesDiscretizewitheverypossiblevalue,29.05.2020,DataMining:PrinciplesandAlgorithms,30,RuleGeneration,StartfromthetargetrelationOnlythetargetrelationisactiveRepeatSearchinallactiverelationsSearchinallrelationsjoinabletoactiverelationsAddthebestpredicatetothecurrentruleSettheinvolvedrelationtoactiveUntilThebestpredicatedoesnothaveenoughgainCurrentruleistoolong,29.05.2020,DataMining:PrinciplesandAlgorithms,31,RuleGeneration:Example,Targetrelation,Firstpredicate,Secondpredicate,RangeofSearch,Addbestpredicatetorule,29.05.2020,DataMining:PrinciplesandAlgorithms,32,Look-one-aheadinRuleGeneration,Twotypesofrelations:EntityandRelationshipOftencannotfindusefulpredicatesonrelationsofrelationship,TargetRelation,SolutionofCrossMine:WhenpropagatingIDstoarelationofrelationship,propagateonemoresteptonextrelationofentity.,Nogoodpredicate,29.05.2020,DataMining:PrinciplesandAlgorithms,33,Roadmap,MotivationRule-basedClassificationTupleIDPropagationRuleGenerationNegativeTupleSamplingPerformanceStudy,29.05.2020,DataMining:PrinciplesandAlgorithms,34,NegativeTupleSampling,ArulecoverssomepositiveexamplesPositiveexamplesareremovedaftercoveredAftergeneratingmanyrules,therearemuchlesspositiveexamplesthannegativeones,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,+,29.05.2020,DataMining:PrinciplesandAlgorithms,35,NegativeTupleSampling(cont.),WhentherearemuchmorenegativeexamplesthanpositiveonesCannotbuildgoodrules(lowsupport)Stilltimeconsuming(largenumberofnegativeexamples)MakesamplingonnegativeexamplesImproveefficiencywithoutaffectingrulequality,+,+,+,+,+,29.05.2020,DataMining:PrinciplesandAlgorithms,36,Roadmap,MotivationRule-basedClassificationTupleIDPropagationRuleGenerationNegativeTupleSamplingPerformanceStudy,29.05.2020,DataMining:PrinciplesandAlgorithms,37,Syntheticdatasets:,Scalabilityw.r.t.numberofrelations,Scalabilityw.r.t.numberoftuples,29.05.2020,DataMining:PrinciplesandAlgorithms,38,RealDataset,PKDDCup99datasetLoanApplication,Mutagenesisdataset(4relations),29.05.2020,DataMining:PrinciplesandAlgorithms,39,Multi-RelationalClassification:Summary,ClassificationacrossmultiplerelationsInterestingpiecesofinformationoftenlieacrossmultiplerelationsItisdesirabletomineacrossmultipleinterconnectedrelationsNewmethodologyinCrossMine(forclassificationmodelbuilding)ID(andclasslabel)propagationleadstoefficiencyandeffectiveness(bypreservingsemantics)inCrossMineRulegenerationandnegativetuplesamplingleadtofurtherimprovedperformanceOurperformancestudyshowsordersofmagnitudefasterandhighaccuracycomparingwiththeRelationalMiningapproachFuturework:classificationinheterogeneousrelationaldatabases,29.05.2020,DataMining:PrinciplesandAlgorithms,40,MultirelationalDataMining,Classificationovermultiple-relationsindatabasesClusteringovermulti-relationsbyuser-guidanceLinkClus:EfficientclusteringbyexploringthepowerlawdistributionDistinct:DistinguishingobjectswithidenticalnamesbylinkanalysisMiningacrossmultipleheterogeneousdataandinformationrepositoriesSummary,29.05.2020,DataMining:PrinciplesandAlgorithms,41,Multi-RelationalandMulti-DBMining,Classificationovermultiple-relationsindatabasesClusteringovermulti-relationsbyUser-GuidanceMiningacrossmulti-relationaldatabasesMiningacrossmultipleheterogeneousdataandinformationrepositoriesSummary,29.05.2020,DataMining:PrinciplesandAlgorithms,42,Motivation1:Multi-RelationalClustering,office,position,Student,name,Targetofclustering,TraditionalclusteringworksonasingletableMostdataissemanticallylinkedwithmultiplerelationsThusweneedinformationinmultiplerelations,29.05.2020,DataMining:PrinciplesandAlgorithms,43,Motivation2:User-GuidedClustering,Course,Userusuallyhasagoalofclustering,e.g.,clusteringstudentsbyresearchareaUserspecifieshisclusteringgoaltoCrossClus,29.05.2020,DataMining:PrinciplesandAlgorithms,44,ComparingwithClassification,User-specifiedfeature(intheformofattribute)isusedasahint,notclasslabelsTheattributemaycontaintoomanyortoofewdistinctvaluesE.g.,ausermaywanttoclusterstudentsinto20clustersinsteadof3Additionalfeaturesneedtobeincludedinclusteranalysis,Alltuplesforclustering,Userhint,29.05.2020,DataMining:PrinciplesandAlgorithms,45,ComparingwithSemi-supervisedClustering,Semi-supervisedclusteringWagstaff,etal01,Xing,etal.02Userprovidesatrainingsetconsistingof“similar”and“dissimilar”pairsofobjectsUser-guidedclusteringUserspecifiesanattributeasahint,andmorerelevantfeaturesarefoundforclustering,Alltuplesforclustering,Semi-supervisedclustering,Alltuplesforclustering,User-guidedclustering,x,29.05.2020,DataMining:PrinciplesandAlgorithms,46,Semi-supervisedClustering,Muchinformation(inmultiplerelations)isneededtojudgewhethertwotuplesaresimilarAusermaynotbeabletoprovideagoodtrainingsetItismucheasierforausertospecifyanattributeasahint,suchasastudentsresearcharea,Tuplestobecompared,29.05.2020,DataMining:PrinciplesandAlgorithms,47,CrossClus:AnOverview,Useanewtypeofmulti-relationalfeaturesofclusteringMeasuresimilaritybetweenfeaturesbyhowtheyclusterobjectsintogroupsUseaheuristicmethodtosearchforpertinentfeaturesUseak-medoids-basedalgorithmforclustering,29.05.2020,DataMining:PrinciplesandAlgorithms,48,Roadmap,OverviewFeaturePertinenceSearchingforFeaturesClusteringExperimentalResults,29.05.2020,DataMining:PrinciplesandAlgorithms,49,Multi-relationalFeatures,Amulti-relationalfeatureisdefinedby:Ajoinpath.E.g.,StudentRegisterOpenCourseCourseAnattribute.E.g.,Course.area(Fornumericalfeature)anaggregationoperator.E.g.,sumoraverageCategoricalFeaturef=StudentRegisterOpenCourseCourse,Course.area,null,areasofcoursesofeachstudent,Valuesoffeaturef,NumericalFeature,e.g.,averagegradesofstudentsh=StudentRegister,Register.grade,averageE.g.h(t1)=3.5,29.05.2020,DataMining:PrinciplesandAlgorithms,50,RepresentingFeatures,Mostimportantinformationofafeaturefishowfclustersobjectsintogroupsfisrepresentedbysimilaritiesbetweeneverypairofobjectsindicatedbyf,Similaritybetweeneachpairoftuplesindicatedbyf.Thehorizontalaxesarethetupleindices,andtheverticalaxisisthesimilarity.ThiscanbeconsideredasavectorofNxNdimensions.,SimilarityvectorVf,29.05.2020,DataMining:PrinciplesandAlgorithms,51,SimilaritybetweenTuples,CategoricalfeaturefDefinedastheprobabilityt1andt2havingthesamevaluee.g.wheneachofthemselectsanothercourse,whatistheprobabilitytheyselectcoursesofthesameareaNumericalfeatureh,simf(t1,t2)=0.5*0.3+0.5*0.3=0.3,29.05.2020,DataMining:PrinciplesandAlgorithms,52,SimilarityBetweenFeatures,ValuesofFeaturefandg,Similaritybetweentwofeaturescosinesimilarityoftwovectors,Vf,Vg,29.05.2020,DataMining:PrinciplesandAlgorithms,53,ComputingFeatureSimilarity,Objects,Featuref,Featureg,DB,AI,TH,Infosys,Cogsci,Theory,Similaritybetweenfeaturevaluesw.r.t.theobjects,sim(fk,gq)=i=1toNf(ti).pkg(ti).pq,Objectsimilarities,hardtocompute,Featurevaluesimilarities,easytocompute,Computesimilaritybetweeneachpairoffeaturevaluesbyonescanondata,29.05.2020,DataMining:PrinciplesandAlgorithms,54,SimilaritybetweenCategoricalandNumericalFeatures,Partsdependingonti,Partsdependingonalltiwithji,29.05.2020,DataMining:PrinciplesandAlgorithms,55,SimilaritybetweenNumericalFeatures,SimilaritybetweennumericalfeatureshandgSupposeobjectsareorderedaccordingtohComputingVhVgScanobjectsinorderofhWhenscanningeachobjectt*,maintainthesetofobjectstwith0h(t*)-h(t)h,inabinarytreesortedwithgUpdateVhVgbyalltwith0h(t*)-h(t*)h,|g(t)-g(t*)|g,Objectstsatisfying0h(t*)-h(t)h,|g(t)-g(t*)|g,t*,Featureh,Objects,t*,29.05.2020,DataMining:PrinciplesandAlgorithms,56,Roadmap,OverviewFeaturePertinenceSearchingforFeaturesClusteringExperimentalResults,29.05.2020,DataMining:PrinciplesandAlgorithms,57,SearchingforPertinentFeatures,DifferentfeaturesconveydifferentaspectsofinformationFeaturesconveyingsameaspectofinformationusuallyclusterobjectsinmoresimilarwaysresearchgroupareasvs.conferencesofpublicationsGivenuserspecifiedfeatureFindpertinentfeaturesbycomputingfeaturesimilarity,29.05.2020,DataMining:PrinciplesandAlgorithms,58,HeuristicSearchforPertinentFeatures,Overallprocedure1.Startfromtheuser-specifiedfeature2.Searchinneighborhoodofexistingpertinentfeatures3.Expandsearchrangegradually,TupleIDpropagationYin,etal.04isusedtocreatemulti-relationalfeaturesIDsoftargettuplescanbepropagatedalonganyjoinpath,fromwhichwecanfindtuplesjoinablewitheachtargettuple,29.05.2020,DataMining:PrinciplesandAlgorithms,59,Roadmap,OverviewFeaturePertinenceSearchingforFeaturesClusteringExperimentalResults,29.05.2020,DataMining:PrinciplesandAlgorithms,60,ClusteringwithMulti-RelationalFeature,GivenasetofLpertinentfeaturesf1,fL,similaritybetweentwoobjectsWeightofafeatureisdeterminedinfeaturesearchbyitssimilaritywithotherpertinentfeaturesForclustering,weuseCLARANS,ascalablek-medoidsNg&Han94algorithm,29.05.2020,DataMining:PrinciplesandAlgorithms,61,Roadmap,OverviewFeaturePertinenceSearchingforFeaturesClusteringExperimentalResults,29.05.2020,DataMining:PrinciplesandAlgorithms,62,Experiments:CompareCrossCluswith,Baseline:OnlyusetheuserspecifiedfeaturePROCLUSAggarwal,etal.99:astate-of-the-artsubspaceclusteringalgorithmUseasubsetoffeaturesforeachclusterWeconvertrelationaldatabasetoatablebypropositionalizationUser-specifiedfeatureisforcedtobeusedineveryclusterRDBCKirstenandWrobel00ArepresentativeILPclusteringalgorithmUseneighborinformationofobjectsforclusteringUser-specifiedfeatureisforcedtobeused,29.05.2020,DataMining:PrinciplesandAlgorithms,63,ClusteringAccuracy,ToverifythatCrossCluscapturesusersclusteringgoal,wedefine“accuracy”ofclusteringGivenaclusteringtaskManuallyfindallfeaturesthatcontaininformationdirectlyrelatedtotheclusteringtaskstandardfeaturesetE.g.,ClusteringstudentsbyresearchareasStandardfeatureset:researchgroup,groupareas,advisors,conferencesofpublications,courseareasAccuracyofclusteringresult:howsimilaritistotheclusteringgeneratedbystandardfeatureset,29.05.2020,DataMining:PrinciplesandAlgorithms,64,MeasureofClusteringAccuracy,AccuracyMeasuredbymanuallylabeleddataWemanuallyassigntuplesintoclustersaccordingtotheirproperties(e.g.,professorsindifferentresearchareas)Accuracyofclustering:Percentageofpairsoftuplesinthesameclu
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- 突发公共卫生事件护理
- 智能制造印刷技术指南
- 化学工业废物处理规定
- 物业管理遗失物品处理方案
- 冒险卡通动漫报告
- 2025新疆兵团粮安储备粮管理有限责任公司招聘19人笔试含答案
- 2025西安光环电子科技有限公司招聘(3-5人)笔试含答案
- 2025年铁岭银行见习生招聘50人笔试含答案
- 企业规章制度的协同与协作
- 2025年事业单位工勤技能-福建-福建计算机信息处理员三级高级历年参考题库含答案解析
- 中成药合理使用培训课件
- 贷款熔断管理办法
- 2025年公安部交管局三力测试题库及答案
- 设备设施运行台账教学幻灯片
- 封路店铺经营补偿方案
- 职业病危害事故救援应急预案
- 2025深入贯彻中央八项规定精神学习教育测试题和答案
- 先天性甲状腺功能减退症诊治指南解读课件
- 学校保安法律知识培训
- 医生进基层活动方案
- 2025-2030年中国蔬果保鲜剂行业市场深度调研及发展趋势与投资价值评估研究报告
评论
0/150
提交评论