版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
Chapter19ClusteringAnalysis
Chapter19ClusteringAnalysis1ContentSimilaritycoefficientHierarchicalclusteringanalysis
Dynamicclusteringanalysis
OrderedsampleclusteringanalysisContentSimilaritycoefficient2DiscriminantAnalysis:havingknownwithcertaintytocomefromtwoormorepopulations,it’samethodtoacquirethediscriminatemodelthatwillallocatefurtherindividualstothecorrectpopulation.
ClusteringAnalysis:astatisticmethodforgroupingobjectsofrandomkindintorespectivecategories.It’susedwhenthere’snopriorihypotheses,buttryingtofindthemostappropriatesortingmethodresortingtomathematicalstatisticsandsomecollectedinformation.Ithasbecomethefirstselectedmeanstouncovergreatcapacityofgeneticmessages.
Botharemethodsofmultivariatestatisticstostudyclassification.
DiscriminantAnalysis:h3Clusteringanalysisisamethodofexploringstatisticalanalysis.Itcanbeclassifiedintotwomajorspeciesaccordingtoitsaims.Forexample,mreferstothenumberofvariables(i.e.indexes)whilenreferstothatofcases(i.e.samples),youcandoasfollows:
(1)R-typeclustering:alsocalledindexclustering.Themethodtosortthemkindsofindexes,aimingatloweringthedimensionofindexesandchoosingtypicalones.
(2)Q-typeclustering:alsocalledsampleclustering.Themethodtosortthenkindsofsamplestofindthecommonnessamongthem.Clusteringanalysisisa4ThemostimportantthingforbothR-typeclusteringandQ-typeclusteringisthedefinitionofsimilarity,thatishowtoquantifysimilarity.Thefirststepofclusteringistodefinethemetricsimilaritybetweentwoindexesortwosamples-similaritycoefficientThemostimportantthingfo5§1similaritycoefficient
1similaritycoefficientofR-typeclusteringSupposetherearemkindsofvariables:X1,X2,…,Xm.R-typeclusteringusuallyusetheabsolutevalueofsimplecorrelationcoefficienttodefinethesimilaritycoefficientamongvariables:Thetwovariablestendtobemoresimilarwhentheabsolutevalueincreases.Similarly,Spearmanrankcorrelationcoefficientcanbeusedtodefinethesimilaritycoefficientofnon-normalvariables.Butwhenthevariablesareallqualitativevariables,it’sbesttousecontingencycoefficient.
§1similaritycoefficient162.SimilaritycoefficientcommonlyusedinQ-typeclustering:Supposetherearencasesregardasnspotsinamdimensionsspace,distancebetweentwospotscanbeusedtodefinesimilaritycoefficient,thetwosamplestendtobemoresimilarwhenthedistancedeclines.(1)Euclideandistance
(2)Manhattandistance
(3)Minkowskidistance:
AbsolutedistancereferstoMinkowskidistancewhenq=1;Euclideandistanceisdirect-viewingandsimpletocompute,buthavingnotregardedthecorrelatedrelationsamongvariables.That’swhyManhattandistancewasintroduced.(19-5)2.Similaritycoefficientcomm7(4)Mahalanobisdistance:it’susedtoexpressthesamplecovariancematrixamongmkindsofvariables.Itcanbeworkedoutasfollows:
Whenit’saunitmatrix,MahalanobisdistanceequalstothesquareofEuclideandistance.
Allofthefourdistancesrefertoquantitativevariables,forthequalitativevariablesandordinalvariables,quantizationisneededbeforeusing.(4)Mahalanobisdistance:it’s8§2HierarchicalClusteringAnalysisHierarchicalclusteringanalysisisamostcommonlyusedmethodtosortoutsimilarsamplesorvariables.Theprocessisasfollows:
1)Atthebeginning,samples(orvariables)areregardedrespectivelyasonesinglecluster,thatis,eachclustercontainsonlyonesample(orvariable).Thenworkoutsimilaritycoefficientmatrixamongclusters.Thematrixismadeupofsimilaritycoefficientsbetweensamples(orvariables).Similaritycoefficientmatrixisasymmetricalmatrix.
2)Thetwoclusterswiththemaximumsimilaritycoefficient(minimumdistanceormaximumcorrelationcoefficient)aremergedintoanewcluster.Computethesimilaritycoefficientbetweenthenewclusterwithotherclusters.Repeatsteptwountilallofthesamples(orvariables)aremergedintoonecluster.§2HierarchicalClustering9Thecalculationofsimilaritycoefficientbetweenclusters
Eachstepofhierarchicalclusteringhastocalculatethesimilaritycoefficientamongclusters.Whenthereisonlyonesampleorvariableineachofthetwoclusters,thesimilaritycoefficientbetweenthemequalstothatofthetwosamplesorthetwovariables,orcomputeaccordingtosectionone.
Whentherearemorethanonesampleorvariableineachcluster,manykindsofmethodscanbeusedtocomputesimilaritycoefficient.Justlist5kindsofmethodsasfollows.andrefertothetwoclusters,whichrespectivelyhasorkindsofsamplesorvariables.
Thecalculationofsimilarity101.ThemaximumsimilaritycoefficientmethodIfthere’rerespectively,samples(orvariables)inclusterand,here’realtogetherandsimilaritycoefficientsbetweenthetwoclusters,butonlythemaximumisconsideredasthesimilaritycoefficientofthetwoclusters.
Attention:theminimumdistancealsomeansthemaximumsimilaritycoefficient.
2.TheMinimumsimilaritycoefficientmethodsimilaritycoefficientbetweenclusterscanbe
calculatedasfollows:
1.Themaximumsimilaritycoeff113.Thecenterofgravitymethod(onlyusedinsampleclustering)Theweightsaretheindexmeansamongclusters.Itcanbecomputedasfollows:
4.Clusterequilibrationmethod(onlyusedin
sample
clustering)workouttheaveragesquaredistancebetweentwosamplesofeachcluster.
Clusterequilibrationisoneofthegoodmethodsinthehierarchicalclustering,becauseitcanfullyreflecttheindividualinformationwithinacluster.
3.Thecenterofgravitymeth125.sumofsquaresofdeviations
methodalsocalledWardmethod,onlyforsampleclustering.Itimitatesthebasicthoughtsofvarianceanalysis,thatis,arationalclassificationcanmakethesumofsquaresofdeviationwithinaclustersmaller,whilethatamongclusterslarger.Supposethatsampleshavebeenclassifiedintogclusters,includingand.Thesumofsquaresofdeviationsofclusterfromsamplesis:(isthemeanof).Themergedsumofsquaresofdeviationsofallthegclustersis.Ifandaremerged,therewillbeg-1clusters.
Theincrementofmergedsumofsquaresofdeviationsis,whichisdefinedasthesquaredistancebetweenthetwoclusters.Obviously,whennsamplesrespectivelyformsasinglecluster,themergedsumofsquaresofdeviationis0.5.sumofsquaresofdeviations13Sample19-1There’refourvariablessurveyingfrom3454femaleadults:height(X1)、lengthoflegs(X2)、waistline(X3)andchestcircumference(X4).Thecorrelationmatrixhasbeenworkedoutasfollows:
Trytousehierarchicalclusteringtoclusterthe4indexes.
ThisisacaseofR-type(index)clustering.Wechoosesimplesimilaritycoefficientasthesimilaritycoefficient,andusemaximumsimilaritycoefficientmethodtocalculatethesimilaritycoefficientamongclusters.Sample19-1There’refou14
Theclusteringprocedureislistedasfollows:(1)eachindexisregardedasasingleclusterG1={X1},G2={X2},G3={X3},G4={X4}.There’realtogether4clusters.
(2)Mergethetwoclusterswithmaximumsimilaritycoefficientintoanewcluster.Inthiscase,wemergeG1andG2(similaritycoefficientis0.852)asG5={X1,X2}.CalculatethesimilaritycoefficientamongG5、G3andG4.
ThesimilarmatrixamongG3,G4andG5:Theclusteringprocedure15
(3)MergeG3andG4asG6={G3,G4},forthistimethesimilaritycoefficientbetweenG3andG4ranksthelargest(0.732).ComputethesimilaritycoefficientbetweenG6andG5.
(4)LastlyG5andG6aremergedintooneclusterG7={G5,G6},whichinfactincludesalltheprimitiveindexes.(3)MergeG3andG4asG6={16Drawthehierarchicaldendrogram(picture19-1)accordingtotheprocessofclustering.Asthepictureindicates,it’sbettertobeclassifiedintotwoclusters:{X1,X2},{X3,X4}.Thatis,lengthindexasoneclusterwhilecircumferenceastheotherone.
height
lengthwaistlinechestoflegscircumference
Picture19-1hierarchicaldendrogramwith4indexesDrawthehierarchicalden17Sample19-2Table19-1liststhemeansofenergyexpenditureandsugarexpenditureoffourathleticitemsfromsixathletes.Inordertoprovidecorrespondentdietarystandardtoimproveperformancerecord,pleaseclustertheathleticitemsusinghierarchicalclustering.
Table19-1measurevaluesof4athleticitemsAthleticitemsEnergyexpenditureX1(joule/minute、m2)SugarexpenditureX2(%)WeightloadingcrouchingG127.89261.421.3150.688Pull-upG223.47556.830.1740.088Push-upsG318.92445.13-1.001-1.441Sit-upG420.91361.25-0.4880.665Sample19-2Table19-118
WechooseMinkowskidistanceinthissample,anduseminimumsimilaritycoefficientmethodtocalculatedistancesamongclusters.Toreducetheeffectofvariabledimensions,thevariablesshouldbestandardizedbeforeanalysis.respectivelyreferstothesamplemeanandstandarddeviationofXi.Thedataaftertransformationarelistedintable19-1.WechooseMinkowskidistanc19Theclusteringprocess:
(1)computethesimilaritycoefficientmatrix(i.e.distancematrix)ofthe4samples.Thedistanceofweightloadingcrouchingandpull-upscanbeworkoutusingformula(19-3).
Likewise,thedistancebetweenweightloadingcrouchingandpush-upscanbecomputedasfollows:Lastly,workoutthedistancematrix:
Theclusteringprocess:
(20(2)ThedistancebetweenG2andG4istheminimum,soG2andG4shouldbeemergedintoanewclusterG5={G2,G4}.ComputethedistancebetweenG5andotherclustersusingminimumsimilaritycoefficientmethodaccordingtoformula(19-8).
ThedistancematrixofG1,G3andG5:
(3)MergeG1andG5intoanewclusterG6={G1,G5}.ComputethedistancebetweenG6andG3:(4)lastlymergeG1andG6intoG7={G1,G6}.Alltheindexeshaveallbeenmergedintoalargecluster.(2)ThedistancebetweenG221
Accordingtotheprocessofclustering,drawoutthethehierarchydendrogram(chart19-2).Asthehierarchydendrogramshowsandexpertisewehavelearned,theindexesshouldbesortedintotwoclusters:{G1,G2,G4}and{G3}.Physicalenergyexpenditureinweightloadingcrouching、pull-upsandsit-upswouldbemuchhigher,dietarystandardimprovementmightberequiredinthoseitemsduringtraining.Accordingtotheprocess22
Analysisofclusteringexamples
Differentdefinitionofsimilaritycoefficientandthatamongclusterswillcausedifferentclusteringresults.Expertiseaswellasclusteringmethodisimportanttotheexplanationofclusteringanalysis.Analysisofclusteringexam23
Sample19-3twenty-sevenpetroleumpitchworkersandpyro-furnacemanaresurveyedabouttheirages,lengthofserviceandsmokinginformation.Inaddition,detectionsofsero-P21,sero-P53,peripheralbloodlymphocyteSCE,thenumberofchromosomalaberrationandthenumberofcellsthathadhappenedchromosomalaberrationwerecarriedoutamongtheseworkers(table19-3).(P21mutiple=P21detectionvalue/themeanofcontrolgroupP21)Pleasesortthe27workersusinghierarchicalclusteringserviceablymethod.
Sample19-3twenty-seven24Table19-3resultofbio-markerdetectionandclusteringanalysisofpetroleumpitchworkersandpyro-furnacemanSampleNumberageLengthofservicesmokeRamus/dSero-P21P21MultipleP53SCENumberofchromosomeaberrationNumberofcellsofChromosomeaberrationresultofculsterin680.358.1144235122035102.761.436.84331352252027842.190.544.1133143272024511.930.4711.4596153822032472.560.8011.68551651313037102.920.3711.6022174091031942.510.4011.40551834172046583.670.4611.3533195029050193.950.4713.4510811042202074825.890.1213.110021157301538002.990.1910.762211236152024781.950.2510.00001133712038273.010.8210.50441145232029842.350.1611.153311552321037492.950.7211.45111011642273049413.890.7313.807611744272039483.110.3313.6516141184021533602.640.3711.40001193821529362.310.6911.401112044272068515.390.9912.28762214327039263.090.4711.95001222610343813.450.5211.807512337182071425.620.8511.81552242892026122.060.3711.65111252593026382.080.7812.251112634142043223.400.4115.005512750322028622.250.698.80221Table19-3resultofbio-marke25ThisexampleapplyminimumsimilaritycoefficientmethodoriginatingfromEuclideandistance,clusterequilibrationmethodandsumofsquaresofdeviationsmethodtoclusterthedata.Theresultsarelistedinchart19-3,chart19-4andchart19-5.Allthevariableshavebeenstandardizedbeforeanalysis.Thisexampleapplyminimum26
chart19-3thehierarchydendrogramof27petroleumpitchworkersandpyro-furnacemenusingminimumsimilaritycoefficientmethodchart19-3thehierarchyden27Chart19-4thehierarchydendrogramof27petroleumpitchworkersandpyro-furnacemenusingclusterequilibrationmethodChart19-4thehierarchydend28Chart19-5thehierarchydendrogramof27petroleumpitchworkersandpyro-furnacemenusingsumofsquaresofdeviationsmethod
Chart19-5thehierarchydendr29Theoutcomesofthethreekindsofclusteringarenotthesame,fromwhichwecanseedifferentwayshavedifferentefficiency.Thedifferencesaremoredistinctincaseofmorevariables.Soyou’dbetterselectefficientvariablesbeforeclusteringanalysis.Suchasthep21andp53inthisexample.Youcangetmoreinformationbyreadingtheclusteringchart.Theoutcomesofthethreek30Accordingtoexpertise,wecanseetheoutcomeofequilibrationclusteringismorereasonable.Theclassifyingresultisfilledinthelastcolumn.Workersnumbered{10,20,23}areclassifiedasoneclass;othersareanother.researchersfindthatworkersnumbered{10,20,23}areinhighriskofcancer.Number{10,20,23,8,16,26}areclusteredtogetheraccordingtothechartofsumofsquaresofdeviations,remindingthatworkersof8,16,26maybeinhighrisktoo.Accordingtoexpertise,we31DynamicclusteringIftherearetoomanysamplesunderclassified,hierarchyclusteringanalysisdemandsmorespacetostoresimilaritycoefficientmatrix.andisquiteinefficient.What’smore,samplescan’tbechangedoncetheyareclassified.Becauseoftheseshortcomings,statistsputforwarddynamicclusteringwhichcanovercometheinefficiencyandadjusttheclassifyingalongwiththeprocessofclustering.DynamicclusteringIfther32Theprincipleofdynamicclusteringanalysisis:firstly,selectseveralrepresentativesamples,calledcohesionpoint,asthecoreofeachclass;secondly,classifyothers.adjustthecoreofeachclassuntilclassifyingisreasonable.Themostcommonwayofdynamicclusteringanalysisisk-means,whichisquit
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- 福州墨尔本理工职业学院《文化传播学》2025-2026学年期末试卷
- 安徽汽车职业技术学院《临床基础检验学技术》2025-2026学年期末试卷
- 集美工业职业学院《纺织工程》2025-2026学年期末试卷
- 阳泉师范高等专科学校《马克思主义市场经济学》2025-2026学年期末试卷
- 徐州医科大学《天然药物学》2025-2026学年期末试卷
- 人工智能与离散数学
- 民航售票员面试讲稿模板
- 重冶固体原料输送工安全行为评优考核试卷含答案
- 商店商品库存管理制度
- 手工皂制皂师保密意识评优考核试卷含答案
- 腹壁脓肿的护理查房
- (2023版)小学道德与法治一年级上册电子课本
- 多维度空间课件
- 景观生态学课件
- 奋战五十天扶摇九万里-高考50天冲刺主题班会 高考倒计时主题班会课件
- GB/T 13927-2022工业阀门压力试验
- 水下作业工程监理实施细则(工程通用版范本)
- GB/T 4393-2008呆扳手、梅花扳手、两用扳手技术规范
- GB/T 34825-2017航天项目工作说明编写要求
- GB/T 14353.10-2010铜矿石、铅矿石和锌矿石化学分析方法第10部分:钨量测定
- 小学语文人教二年级下册 有魔力的拟声词
评论
0/150
提交评论