Information Encoding in Biological Molecules DNA and :生物分子DNA编码的信息.ppt_第1页
Information Encoding in Biological Molecules DNA and :生物分子DNA编码的信息.ppt_第2页
Information Encoding in Biological Molecules DNA and :生物分子DNA编码的信息.ppt_第3页
Information Encoding in Biological Molecules DNA and :生物分子DNA编码的信息.ppt_第4页
Information Encoding in Biological Molecules DNA and :生物分子DNA编码的信息.ppt_第5页
已阅读5页,还剩71页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

1 MicroarrayDataAnalysis ClassdiscoveryandClassprediction ClusteringandDiscrimination 2 Geneexpressionprofiles ManygenesshowdefinitechangesofexpressionbetweenconditionsThesepatternsarecalledgeneprofiles 3 Motivation 1 Theproblemoffindingpatterns Itiscommontohavehybridizationswhereconditionsreflecttemporalorspatialaspects YeastcycledataTumordataevolutionafterchemotherapyCNSdataindifferentpartofbrainInterestinggenesmaybethoseshowingpatternsassociatedwithchanges Ourproblemseemstobedistinguishinginterestingorrealpatternsfrommeaninglessvariation atthelevelofthegene 4 Findingpatterns Twoapproaches Ifpatternsalreadyexist Profilecomparison Distanceanalysis Findthegeneswhoseexpressionfitsspecific predefinedpatterns Findthegeneswhoseexpressionfollowsthepatternofpredefinedgeneorsetofgenes Ifwewishtodiscovernewpatterns Clusteranalysis classdiscovery Carryoutsomekindofexploratoryanalysistoseewhatexpressionpatternsemerge 5 Motivation 2 Tumorclassification Areliableandpreciseclassificationoftumoursisessentialforsuccessfuldiagnosisandtreatmentofcancer Currentmethodsforclassifyinghumanmalignanciesrelyonavarietyofmorphological clinical andmolecularvariables Inspiteofrecentprogress therearestilluncertaintiesindiagnosis Also itislikelythattheexistingclassesareheterogeneous DNAmicroarraysmaybeusedtocharacterizethemolecularvariationsamongtumoursbymonitoringgeneexpressiononagenomicscale Thismayleadtoamorereliableclassificationoftumours 6 Tumorclassification cont Therearethreemaintypesofstatisticalproblemsassociatedwithtumorclassification Theidentificationofnew unknowntumorclassesusinggeneexpressionprofiles clusteranalysis Theclassificationofmalignanciesintoknownclasses discriminantanalysis Theidentificationof marker genesthatcharacterizethedifferenttumorclasses variableselection 7 ClusterandDiscriminantanalysis Thesetechniquesgroup orequivalentlyclassify observationalunitsonthebasisofmeasurements Theydifferaccordingtotheiraims whichinturndependontheavailabilityofapre existingbasisforthegrouping Inclusteranalysis unsupervisedlearning classdiscovery therearenopredefinedgroupsorlabelsfortheobservations Discriminantanalysis supervisedlearning classprediction isbasedontheexistenceofgroups labels 8 Clusteringmicroarraydata Clustercanbeappliedtogenes rows mRNAsamples cols orbothatonce Clustersamplestoidentifynewcellortumoursubtypes Clusterrows genes toidentifygroupsofco regulatedgenes Wecanalsoclustergenestoreduceredundancye g forvariableselectioninpredictivemodels 9 Advantagesofclustering Clusteringleadstoreadilyinterpretablefigures Clusteringstrengthensthesignalwhenaveragesaretakenwithinclustersofgenes Eisen Clusteringcanbehelpfulforidentifyingpatternsintimeorspace Clusteringisuseful perhapsessential whenseekingnewsubclassesofcellsamples tumors etc 10 Applicationsofclustering 1 Alizadehetal 2000 DistincttypesofdiffuselargeB celllymphomaidentifiedbygeneexpressionprofiling Threesubtypesoflymphoma FL CLLandDLBCL havedifferentgeneticsignatures 81casestotal DLBCLgroupcanbepartitionedintotwosubgroupswithsignificantlydifferentsurvival 39DLBCLcases 11 Clustersonbothgenesandarrays TakenfromNatureFebruary 2000PaperbyAllizadeh AetalDistincttypesofdiffuselargeB celllymphomaidentifiedbyGeneexpressionprofiling 12 Discoveringtumorsubclasses DLBCLisclinicallyheterogeneousSpecimenswereclusteredbasedontheirexpressionprofilesofGCB cellassociatedgenes Twosubgroupswerediscovered GCB likeDLBCLActivatedB likeDLBCL 13 Applicationsofclustering 2 Ana vebutneverthelessimportantapplicationisassessmentofexperimentaldesignIfonehasanexperimentwithdifferentexperimentalconditions andineachofthemtherearebiologicalandtechnicalreplicates WewouldexpectthatthemorehomogeneousgroupstendtoclustertogetherTech replicates Biol Replicates DifferentgroupsFailuretoclustersosuggestsbiasduetoexperimentalconditionsmorethantoexistingdifferences 14 Basicprinciplesofclustering Aim togroupobservationsthatare similar basedonpredefinedcriteria Issues Whichgenes arraystouse Whichsimilarityordissimilaritymeasure Whichclusteringalgorithm Itisadvisabletoreducethenumberofgenesfromthefullsettosomemoremanageablenumber beforeclustering Thebasisforthisreductionisusuallyquitecontextspecific seelaterexample 15 Twomainclassesofmeasuresofdissimilarity CorrelationDistanceManhattanEuclideanMahalanobisdistanceManymore 16 Twobasictypesofmethods Partitioning Hierarchical 17 Partitioningmethods Partitionthedataintoapre specifiednumberkofmutuallyexclusiveandexhaustivegroups Iterativelyreallocatetheobservationstoclustersuntilsomecriterionismet e g minimizewithinclustersumsofsquares Examples k means self organizingmaps SOM PAM etc Fuzzy needsstochasticmodel e g Gaussianmixtures 18 Hierarchicalmethods Hierarchicalclusteringmethodsproduceatreeordendrogram Theyavoidspecifyinghowmanyclustersareappropriatebyprovidingapartitionforeachkobtainedfromcuttingthetreeatsomelevel Thetreecanbebuiltintwodistinctwaysbottom up agglomerativeclustering top down divisiveclustering 19 Agglomerativemethods Startwithnclusters Ateachstep mergethetwoclosestclustersusingameasureofbetween clusterdissimilarity whichreflectstheshapeoftheclusters Between clusterdissimilaritymeasuresMean link averageofpairwisedissimilaritiesSingle link minimumofpairwisedissimilarities Complete link maximum ofpairwisedissimilarities Distancebetweencentroids 20 Distancebetweencentroids Single link Complete link Mean link 21 Divisivemethods Startwithonlyonecluster Ateachstep splitclustersintotwoparts SplittogivegreatestdistancebetweentwonewclustersAdvantages Obtainthemainstructureofthedata i e focusonupperlevelsofdendogram Disadvantages Computationaldifficultieswhenconsideringallpossibledivisionsintotwogroups 22 1 5 2 3 4 1 2 5 3 4 1 5 1 2 3 4 5 Agglomerative IllustrationofpointsIntwodimensionalspace 1 5 3 4 2 23 1 5 2 3 4 1 2 5 3 4 1 5 1 2 3 4 5 Agglomerative Treere ordering 1 5 3 4 2 1 5 2 3 4 24 PartitioningorHierarchical Partitioning AdvantagesOptimalforcertaincriteria GenesautomaticallyassignedtoclustersDisadvantagesNeedinitialk Oftenrequirelongcomputationtimes Allgenesareforcedintoacluster HierarchicalAdvantagesFastercomputation Visual DisadvantagesUnrelatedgenesareeventuallyjoinedRigid cannotcorrectlaterforerroneousdecisionsmadeearlier Hardtodefineclusters 25 HybridMethods MixelementsofPartitioningandHierarchicalmethodsBaggingDudoit Fridlyand 2002 HOPACHvanderLaan Pollard 2001 26 Threegenericclusteringproblems Threeimportanttasks whicharegeneric are 1 Estimatingthenumberofclusters 2 Assigningeachobservationtoacluster 3 Assessingthestrength confidenceofclusterassignmentsforindividualobservations Notequallyimportantineveryproblem 27 Estimatingnumberofclustersusingsilhouette Definesilhouettewidthoftheobservationas S b a max a b Whereaistheaveragedissimilaritytoallthepointsintheclusterandbistheminimumdistancetoanyoftheobjectsintheotherclusters Intuitively objectswithlargeSarewell clusteredwhiletheoneswithsmallStendtoliebetweenclusters Howmanyclusters Performclusteringforasequenceofthenumberofclusterskandchoosethenumberofcomponentscorrespondingtothelargestaveragesilhouette Issueofthenumberofclustersinthedataismostrelevantfornovelclassdiscovery i e forclusteringsamples 28 Estimatingnumberofclustersusingthebootstrap Thereareotherresampling e g DudoitandFridlyand 2002 andnon resamplingbasedrulesforestimatingthenumberofclusters forreviewseeMilliganandCooper 1978 andDudoitandFridlyand 2002 Thebottomlineisthatnoneworkverywellincomplicatedsituationand toalargeextent clusteringliesoutsideausualstatisticalframework Itisalwaysreassuringwhenyouareabletocharacterizeanewlydiscoveredclustersusinginformationthatwasnotusedforclustering 29 Limitations Clusteranalyses Usuallyoutsidethenormalframeworkofstatisticalinference lessappropriatewhenonlyafewgenesarelikelytochange NeedslotsofexperimentsAlwayspossibletoclusterevenifthereisnothinggoingon Usefulforlearningaboutthedata butdoesnotprovidebiologicaltruth 30 Discrimination orClasspredictionorSupervisedLearning 31 Motivation Astudyofgeneexpressiononbreasttumours NHGRI J Trent HowsimilararethegeneexpressionprofilesofBRCA1andBRCA2 andsporadicbreastcancerpatientbiopsies Canweidentifyasetofgenesthatdistinguishthedifferenttumortypes Tumorsstudied 7BRCA1 8BRCA2 7Sporadic 32 Discrimination ApredictororclassifierforKtumorclassespartitionsthespaceXofgeneexpressionprofilesintoKdisjointsubsets A1 AK suchthatforasamplewithexpressionprofilex x1 xp Akthepredictedclassisk Predictorsarebuiltfrompastexperience i e fromobservationswhichareknowntobelongtocertainclasses SuchobservationscomprisethelearningsetL x1 y1 xn yn AclassifierbuiltfromalearningsetLisdenotedbyC L X 1 2 K withthepredictedclassforobservationxbeingC x L 33 DiscriminationandAllocation LearningSetDatawithknownclasses ClassificationTechnique Classificationrule Datawithunknownclasses ClassAssignment Discrimination Prediction 34 Badprognosisrecurrence 5yrs GoodPrognosisrecurrence 5yrs ReferenceLvan tVeeretal 2002 Geneexpressionprofilingpredictsclinicaloutcomeofbreastcancer Nature Jan ObjectsArrayFeaturevectorsGeneexpression PredefineclassesClinicaloutcome newarray Learningset Classificationrule GoodPrognosisMatesis 5 35 B ALL T ALL AML ReferenceGolubetal 1999 Molecularclassificationofcancer classdiscoveryandclasspredictionbygeneexpressionmonitoring Science286 5439 531 537 ObjectsArrayFeaturevectorsGeneexpression PredefineclassesTumortype newarray Learningset ClassificationRule T ALL 36 Componentsofclassprediction ChooseamethodofclasspredictionLDA KNN CART Selectgenesonwhichthepredictionwillbebase FeatureselectionWhichgeneswillbeincludedinthemodel ValidatethemodelUsedatathathavenotbeenusedtofitthepredictor 37 Predictionmethods 38 Choosepredictionmodel PredictionmethodsFisherlineardiscriminantanalysis FLDA anditsvariants DLDA Golub sgenevoting Compoundcovariatepredictor NearestNeighborClassificationTreesSupportvectormachines SVMs NeuralnetworksAndmanymore 39 Fisherlineardiscriminantanalysis Firstappliedin1935byM BarnardatthesuggestionofR A Fisher 1936 Fisherlineardiscriminantanalysis FLDA consistsofi findinglinearcombinationsxaofthegeneexpressionprofilesx x1 xp withlargeratiosofbetween groupstowithin groupssumsofsquares discriminantvariables ii predictingtheclassofanobservationxbytheclasswhosemeanvectorisclosesttoxintermsofthediscriminantvariables 40 FLDA 41 ClassificationruleMaximumlikelihooddiscriminantrule Amaximumlikelihoodestimator MLE choosestheparametervaluethatmakesthechanceoftheobservationsthehighest Forknownclassconditionaldensitiespk X themaximumlikelihood ML discriminantrulepredictstheclassofanobservationXbyC X argmaxkpk X 42 GaussianMLdiscriminantrules FormultivariateGaussian normal classdensitiesX Y k N k k theMLclassifierisC X argmink X k k 1 X k log k Ingeneral thisisaquadraticrule Quadraticdiscriminantanalysis orQDA Inpractice populationmeanvectors kandcovariancematrices kareestimatedbycorrespondingsamplequantities 43 MLdiscriminantrules specialcases DLDA Diagonallineardiscriminantanalysisclassdensitieshavethesamediagonalcovariancematrix diag s12 sp2 DQDA Diagonalquadraticdiscriminantanalysis classdensitieshavedifferentdiagonalcovariancematrix k diag s1k2 spk2 Note WeightedgenevotingofGolubetal 1999 isaminorvariantofDLDAfortwoclasses differentvariancecalculation 44 ClassificationwithSVMs Generalizationoftheideasofseparatinghyperplanesintheoriginalspace Linearboundariesbetweenclassesinhigher dimensionalspaceleadtothenon linearboundariesintheoriginalspace Adaptedfrominternet 45 Nearestneighborclassification Basedonameasureofdistancebetweenobservations e g Euclideandistanceoroneminuscorrelation k nearestneighborrule FixandHodges 1951 classifiesanobservationxasfollows findthekobservationsinthelearningsetclosesttoxpredicttheclassofxbymajorityvote i e choosetheclassthatismostcommonamongthosekobservations Thenumberofneighborskcanbechosenbycross validation moreonthislater 46 Nearestneighborrule 47 Classificationtree Binarytreestructuredclassifiersareconstructedbyrepeatedsplitsofsubsets nodes ofthemeasurementspaceXintotwodescendantsubsets startingwithXitself EachterminalsubsetisassignedaclasslabelandtheresultingpartitionofXcorrespondstotheclassifier 48 Classificationtrees Mi1 1 4Node1Class1 10Class2 10 Mi2 0 5Node2Class1 6Class2 9 Node4Class1 0Class2 4Prediction 2 Node3Class1 4Class2 1Prediction 1 yes yes no no Gene1 Gene2 Mi2 2 1Node5Class1 6Class2 5 Node7Class1 5Class2 0Prediction 1 Node6Class1 1Class2 5Prediction 2 Gene3 49 Threeaspectsoftreeconstruction Splitselectionrule Example ateachnode choosesplitmaximizingdecreaseinimpurity e g Giniindex entropy misclassificationerror Split stopping Thedecisiontodeclareanodeasterminalortocontinuesplitting Example growlargetree prunetoobtainasequenceofsubtrees thenusecross validationtoidentifythesubtreewithlowestmisclassificationrate Theassignment ofeachterminalnodetoaclassExample foreachterminalnode choosetheclassminimizingtheresubstitutionestimateofmisclassificationprobability giventhatacasefallsintothisnode Supplementaryslide 50 Otherclassifiersinclude SupportvectormachinesNeuralnetworksBayesianregressionmethodsProjectionpursuit 51 Aggregatingpredictors Breiman 1996 1998 foundthatgainsinaccuracycouldbeobtainedbyaggregatingpredictorsbuiltfromperturbedversionsofthelearningset Inclassification themultipleversionsofthepredictorareaggregatedbyvoting 52 Aggregatingpredictors 1 Bagging Bootstrapsamplesofthesamesizeastheoriginallearningset non parametricbootstrap Breiman 1996 convexpseudo data Breiman 1998 2 Boosting FreundandSchapire 1997 Breiman 1998 Thedataareresampledadaptivelysothattheweightsintheresamplingareincreasedforthosecasesmostoftenmisclassified Theaggregationofpredictorsisdonebyweightedvoting 53 Predictionvotes Foraggregatedclassifiers predictionvotesassessingthestrengthofapredictionmaybedefinedforeachobservation Thepredictionvote PV foranobservationxisdefinedtobePV x maxk bwbI C x Lb k bwb Whentheperturbedlearningsetsaregivenequalweights i e wb 1 thepredictionvoteissimplytheproportionofvotesforthe winning class regardlessofwhetheritiscorrectornot Predictionvotesbelongto 0 1 54 Anothercomponentinclassificationrules aggregatingclassifiers TrainingSetX1 X2 X100 Classifier1 Resample1 Classifier2 Resample2 Classifier499 Resample499 Classifier500 Resample500 Examples BaggingBoostingRandomForest Aggregateclassifier 55 Aggregatingclassifiers Bagging TrainingSet arrays X1 X2 X100 Tree1 Resample1X 1 X 2 X 100 Letsthetreevote Tree2 Resample2X 1 X 2 X 100 Tree499 Resample499X 1 X 2 X 100 Tree500 Resample500X 1 X 2 X 100 Testsample Class1 Class2 Class1 Class1 90 Class110 Class2 56 Featureselection 57 Featureselection Aclassificationrulemustbebasedonasetofvariableswhichcontributeusefulinformationfordistinguishingtheclasses Thissetwillusuallybesmallbecausemostvariablesarelikelytobeuninformative Someclassifiers likeCART performautomaticfeatureselectionwhereasothers likeLDAorKNN donot 58 Approachestofeatureselection Filtermethodsperformexplicitfeatureselectionpriortobuildingtheclassifier Onegeneatatime selectfeaturesbasedonthevalueofanunivariatetest Thenumberofgenesorthetestp valuearetheparametersoftheFSmethod WrappermethodsperformFSimplicitly asapartoftheclassifierbuilding Inclassificationtreesfeaturesareselectedateachstepbasedonreductioninimpurity Thenumberoffeaturesisdeterminedbypruningthetreeusingcross validation 59 Whyselectfeatures LeadtobetterclassificationperformancebyremovingvariablesthatarenoisewithrespecttotheoutcomeMayprovideusefulinsightsintoetiologyofadisease Caneventuallyleadtothediagnostictests e g breastcancerchip 60 Whyselectfeatures CorrelationplotData Leukemia 3class Nofeatureselection Top100featureselectionSelectionbasedonvariance 1 1 61 Performanceassessment 62 Performanceassessment Beforeusingaclassifierforpredictionorprognosticoneneedsameasureofitsaccuracy TheaccuracyofapredictorisusuallymeasuredbytheMissclassificationrate The ofindividualsbelongingtoaclasswhichareerroneouslyassignedtoanotherclassbythepredictor AnimportantproblemariseshereWearenotinterestedintheabilityofthepredictorforclassifyingcurrentsamplesOneneedstoestimatefutureperformancebasedonwhatisavailable 63 Estimatingtheerrorrate Usingthesamedatasetonwhichwehavebuiltthepredictortoestimatethemissclassificationratemayleadtoerroneouslylowvaluesduetooverfitting ThisisknownastheresubstitutionestimatorWeshoulduseacompletelyindependentdatasettoevaluatetheclassifier butitisrarelyavailable WeusealternativesapproachessuchasTestsetestimatorCrossvalidation 64 Performanceassessment I Resubstitutionestimation Computetheerrorrateonthelearningset Problem downwardbiasTestsetestimation ProceedsintwostepsDividelearningsetintotwosub sets LandT BuildtheclassifieronLandcomputeerrorrateonT ThisapproachisnotfreefromproblemsLandTmustbeindependentandidenticallydistributed Problem reducedeffectivesamplesize 65 Diagramofperformanceassessment I Resubstitutionestimation Trainingset Performanceassessment TrainingSet Independenttestset Classifier Classifier Testsetestimation 66 Performanceassessment II V foldcross validation CV estimation CasesinlearningsetrandomlydividedintoVsubsetsof nearly equalsize Buildclassifiersbyleavingonesetout computetestseterrorratesontheleftoutsetandaveraged Bias variancetradeoff smallerVcangivelargerbiasbutsmallervarianceComputationallyintensive Leave one outcrossvalidation LOOCV SpecialcaseforV n Workswellforstableclassifiers k NN LDA SVM 67 Diagramofperformanceassessment II Trainingset Performanceassessment TrainingSet Independenttestset CV Learningset CV Testset Classifier Classifier Classifier Resubstitutionestimation Testsetestimation CrossValidation 68 Performanceassessment III CommonpracticeTodofeatureselectionusingthelearning TodoCVonlyformodelbuildingandclassification However usuallyfeaturesareunknownandtheintendedinferenceincludesfeatureselection CVestimatesasabovetendtobedownwardbiased Features variables shouldbeselectedonlyfromthelearningsetusedtobuildthemodel andnottheentireset 69 Examples 70 Reference1RetrospectivestudyLvan tVeeretalGeneexpressionprofilingpredictsclinicaloutcomeofbreastcancer

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论