郝柏林院士的生物信息学讲座_第1页
郝柏林院士的生物信息学讲座_第2页
郝柏林院士的生物信息学讲座_第3页
郝柏林院士的生物信息学讲座_第4页
郝柏林院士的生物信息学讲座_第5页
已阅读5页,还剩27页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

FindingGenesintheRiceGenome,HaoBailinT-LifeResearchCenter,FudanUniversityBeijingGenomicsInstitute,AcademiaSinicaInstituteofTheoreticalPhysics,AcademiaSinica(,TwoCultivarsofRice,Oryzasativassp.indica(籼稻)Oryzasativassp.Japonica(粳稻)ThedifferencewasdescribedinXuShens(许慎说文解字)ChineseDictionaryofEastHanDynasty(2ndCenturyAD)J.H.Zhangetal.RicecultivationofJianhuRemainsinHenanProvince,ScienceJ.(科学杂志),53(4),2002,3(inChinese),cccaatatcttgcttcagcaagatattgggtatttctagctttcctttcttcaaaaattgctatatgttagcagaaaagccttatccattaagagatggaacttcaagagcagctaggtctagagggaagttgtgagcattacgttcgtgcattacttccataccaagattagcacggttgatgatatcagcccaagtattaataacgcgaccttggctatcaactacagattggttgaaattgaatccgtttagattgaaagccatagtactaatacctaaagcagtgaaccaaatccctactacaggccaagcagccaagaagaagtgtaaagaacgagagttgttaaaactagcatattggaagattaatcggccaaaataaccatgagcggccacaatattataagtttcttcctcttgaccaaatctgtaaccctcattagcagattcgttttcagtggtttccctgatcaaactagaggttaccaaggaaccatgcatagcactgaatagggaaccgccgaatacaccagctacacctaacatgtgaaatggatgcataaggatgttatgctctgcctggaatacaatcataaagttgaaagtaccagatattcctaaaggcataccatcagagaaacttccttgaccaatagggtaaatcaagaaaacagcagtagcagctgcaacaggagctgaatatgcaacagcaatccaaggacgcatacccagacggaaactcagttcccactcacgacccatataacaagctacaccaagtaagaagtgtagaacaattagctcataaggaccaccattgtataaccactcatcaacagatgcagcttcccaaattgggtaaaagtgcaatccgatcgccgcagaagtaggaataatggcaccagagataatattgtttccgtaaagtaaagaaccagaaacaggctcacgaataccatcaatatctactggaggggcagcgatgaaggcgataataaatacagaagttgcggtcaataaggtagggatcatcaaaacaccgaaccatccgatgtaaagacggttttcggtgctagttatccagttgcagaagcgaccccacaggcttgtactttcgcgtctctctaaaattgcagtcatggtaagatcttggtttattcaaattgcaaggactcccaagcacacgtattaactagaaagataatagaaggcttgttatttaacagtataatatagactatataccaatgtcaaccaagccagccccgacagttgtatatccatacaacaaaatttaccaaaccaaaaaattttgtaaatgaagtgagtgaaaaatcaaaactcagattgctcctttctagtttccatatgggttgcccgggactcgaacccggaactagtcggatggagtagataattattccttgttacaatagagaaaaaacctctccccaaatcgtgcttgcatttttcattgcacacgactttccctatgtagaaataggctatttctattccgaagaggaagtctactaatttttttagtagtaagttgattcacttactatttattatagtacagagaacatttcagaatggaaactgtgaaagttttaccttgatcatttatcaatcatttctagtttattagttttgtttaatgattaattaagaggattcaccagatcattgatacggagaatatccaaataccaaatacgctcactgtgcgatccacggaaagaaaagtaagttgttttggcgaacatcaaagaaaaaacttgctcttcttccgtaaaaaattcttctaaaaataccgaacccaaccattgcataaaagctcgtaccgtgcttttatgtttacgagctaaagttctagcgcatgaaagtcgaagtatatactttagtcgatacaaagtcttcttttttgaagatccactgtgataatgaaaaagatttctacatatccgaccaaaccgatcaagaatatcccaatccgataaatcggtccaaattggtttactaataggatgccccgatccagtacaaaattgggcttttgctaaagatccaatgagaggagtaacagggactttggtatcgaattttttcatttgagtatctattagaaatgaattctccagcatttgattccttactaacaaagaatttattggtacacttgaaaagtaccccagaaaatcgaagcaagagttttctaattggtttagatggatcctttgcggttgagtccaaaaagagaaagaatattgccacaaacggacaaggtaacatttccatttcttcttcaaaagaagagttccttttgatgcaagaattgcctttccttgatatcgaacataatgcataaggggatccataacgaaccatatggttttccgaaaaaaagcagggtacattaacccaaaatgttccatcttcctagaaaagatgattcgttccagaaaggttccggaagaagttaatcgcaagcaagaagattgtttacgaagaaacaacaagaaaaattcatattctgatacataagagttatataggaaccgaaatagtcttttattttcttttttcaaaataaaaatggatttcattgaagtaataaaactattccaattcgagtagtagttgagaaagaatcgcaataaatgcaaggatggaacatcttggatccggtattgaaggagttgaagcaagatatccaaatggataggatagggtatttctatatgtgctagataatgtaagtgcaaaaatttgtcttctaaaaaaggaaatattgaatgaatagatcgtaaattctgaaactttggtatttctttttcttccggacaagactgttctcgtagcgagaatgggatttctacaacgatcgcaaacccctcagatagaatctgagaataaaactcagaataaaaaaaattgttgtaatccaataatcgatcttggttaggatgattaaccaaattaatccaaaaattctgctgatacattcgaatcattaaccgtttcacaagtagtgaactaaatttcttgttattagaaccaataatttcgacaagttcggaaccatttaatccataatcatgggcaaacacataaatgtactcctgaaagagtagtgggtagacgaaatattgtctaggaaatttaagtttttctgaataaccctcgaatttttccatttgtatttctacttgaatcagagagagagaaatatttctcggtttatcaaatggtgatacatagtacaatatggtcagaacagggtgttgcattttttaatacaaacccctggggaagaaaaggagtctaatccacggatctttttccgctccttttctatccaatttgtttatgtttgttctaattacaaaagagaacaaatcctttatttttgcaggccaattgctcttttgactttgggatacagtctctttatcaatatactgcttcttttacacattcaatccataacatccttttcaatccaaaatcaagaataattaggatttctaaaaaaaaaagaaaaaatcaaaggtctactcataggaaaaccagcttttccctacatcaggcactaatctatttttaacgtctaattagatcagggagttcttccaattaagaagttaagctcgttgctttttgttttaccagaattggagccaggctctatccatttattcattagacccagaaaatcagaatttttttattccattccaaaaatccaaaataagaaattgattttattacgacatgctattttttccattcattacccttgaggatcagtcgcggtcttatagactctaccaagagtctggacgaattttttgcttcatccaaatgtgtaaaagatcatagtcgcacttaaaagccgagtactctaccattgagttagcaacccagataaactaggatcttagatacgatcgaaatccaaaaatcaatggaattacaccgcacacccctgtcaaaatcttaaaatagcaagacattaaaagaaagattttatcaccattgaaaacactcagataccaaaaggaacgggtctggttaaatttcactaaggttaaaagtggcaccaatcacgatcgtaaaattgtcatttttttagcatttttatttaaataaataaataaatcttgtatgagagtacaaacaagagggacaaccctaccatttgagcaaagtgtaggcaaaaaacctaatagggagtgaggataaagagacttatccatctacaaattctagatgttcaatggacctttgtcaatggaaatacaatggtaagaaaaaaattagatagaaaaactcaaaaaaataaaggcttatgttggattggcacgacataaatccagtcaaaaataggattaagaaagaggcaaattatttctaaatagttagacaacaagggatactagtgagcctctcctagttttttattcatttagttcttcaattaactcaaagttctttctttttctttaaagaattccgccttccttaaaatatcagaaacggttcttgtaggttgagcacctttttcaaggaaatagagaatagctggaacatttaaacaagtttgattctttatcggatcataaaaacctacttttcgaagatctcttccttctcttcgagatcgaacatcaattgcaacgattcgatagacagcttattgggatagatgtagataaataaagccccccctagaaacgtataggaggttttctcctcatacggctcgagaatatgacttgcattaatttccgtacagaaaaaacaaatttcatttatactcatgactcaagttgactaattttgattgacagacttgaaagaaaaaaatcctttgaaattttttgagtcgtctctaaactcttttctttgcctcatctcgaacaaattcacttttattccttattccggtccaattctattgttgagacagttgaaaatcgtgtttacttgttcgggaatcctttatctttgatttgtgaaatccttgggtttaaacattacttcgggaattcttattcttttttctttcaaaagagtagcaacatacccttttttcttatttccttcgataaagcatttccctcttctatagaaatcgaatatgagcgattgattctgatagactttaatcaaaagagttttcccatatcttccaaaattggactttcttcttattttaaccttttgatttctatattatttcgatttctatattaagggtagaatgacaaagttggcctaatttattagttttcactaaccctagattctttcccttgataaaaaataaattctgtcctctcgagctccatcgtgtactatttacttagcttacttacaaacaacccagcgaaaattcggttcgggacgaatagaacagactatgtcgagccaagagcattttcattactatggaaaatggtggatagcaaaatccacaatcgatcgtgtccttcaagtcgcacgttgctttctaccacatcgttttaaacgaagttttaacataacattcctctaatttcattgcaaagtgttatagggaattgatccaatatggatggaatcatgaatagtcattagtttcgttttttgtatactaattcaaacttgctttgctatctatggagaaatatgaataaaagaaattaagtatttatcgggaaagactccgcaaagagccaatttatttaaacccatattctatcatatgaatgaaatatagttcgaaaaaagggaataaacaagtttgcttaagacttatttattatggaatttccatcctcaacagaggactcgagatgatcaatccaatcctgaaatgataagagaagaattgactcttctccaacaaataaactatcaacctcccgtttaattaatttaattaatatattagattagcaatctatttttccataccatttttccgtaacaaaactaattaactattaactagttaaactattgcaatgaaaagaaagttttttggtagttatagaattctcgtatttcttcgactcgaataccaaaagaaagaaaaaaatgaagtaaaaaaaacgcatttcctgtaaagtaaaattaaggtctttgcttttacttattttttcttttacctaaaagaagcaactccaaatcaaaattgaatccattctatctaacgagcagttcttatcttatctttaccgggatggatcattctggatatttaaaaaatcgcggatcgagatcgtttttgcttaaccaaagaaagaaaaagaagaaggaaccttttttactaataaaatactataaaaaaaatttatctctatcataaatctatctctaccataaaggaataggtctcgttttttatacaatgttctacgtcaagtttaaaattttttcatgaaaaaaagattttcaatttgactggacttgacactggattatgttttctgagacagaaaatgaacgcattaggactgcatcgaatctaagagtttataagagaaaaaaattctctttaataaactttatgtctcgtgcagaatacaatacgatttcatctttcgtttcatcagaaaaaatctgggacggaaggattcgaacctccgagtaacgggaccaaaacccgctgccttaccacttggccacgccccatttcgggttttatgcgacactaataaacagtattatgtttatttcttattcgtcaatcctacttcaattacataaaaatggggggtattctcttggtaggattctagacatgcgaataatatagaatccaaaaaatgcattgatcattacatggaattctattaagatattatatgaaagtcgaatttcttccactctcatttgagagtgcgaatacaaggaggtattttgtgtttgggaaagtccgaagaaaaaaggattttgaatcctccttttcctttttcccttagaaaaataactcaatcaaaatccaattatctactctacaagaacgaaacgcttgttatgcctaatatacttagtttaacctgtatttgttttaattctgttatttatccgactagttttttcttcgccaaattgcccgaagcttatgccattttcaatccaatcgtggattttatgcctgtcatacctgtactcttttttctattagcctttgtttggcaagctgctgtaagttttcgatgaaatctttactactctgtctgccaaattgaatcatgtattcattctaaaaaaattcgaaaaatggataagagccgagaagtcttatattatgaaccttcgattctaaaattcaaattcttctacattgaatgtatagctgcagcaataaatttggatcagcctttctactccctgcatctacgttgagcaggtatctttaggtaaccgcacaatacctaacctaatttattgataagagtgcttattataaatcaattcttgcaatttttttcaaaaattgatttttgcatttttaggtgtcaaaataaacaaaacccatcctagtggatttgtgtggtaaggaaaaacgggtaatctattccttaaaaaaaaatcttggagattatgtaatgcttactctcaaactttttgtttatacagtagtgatattctttgtttccctctttatctttggattcttatctaatgatccaggacgtaatcctgggcgtgacgagtaaaaatccaaaattttttcttacaaattggatttgtttcatacatttatctacgagaaaatccgggggtcagaattccttccaattcgaaagtcccaaacgatccgagggggcggaaagagagggattcgaaccctcggtacaaaaaaattgtacaacggattagcaatccgccgctttagtccactcagccatctctccccgttccaaatcgaaaggtttccgtgatatgacagaggcaagaaataacgattgcaaaaaatccttcctttttctttcaaaagttcaaaaaaattatattgccaattccattttagttatattcttttttcttaatgttaataaaaaaaagaagaaaattcttcttttttctttctaattctaaaattggatattggctaaaagacaatcagatagattttctcttcagcaggcatttccatataggacttgttataataaaacaagcaggttatagaaaaaaactcttttttttattatttatcaacaaagcaaaaaggggtcttatcaaaccaacccaccccataaaattggaaagaaagataaagtaagtggacctgactccttgaatgaggcctctatccgctattctgatatataaattcgatgtagatgaaattgtataagtggatttttttgtatttccttagacttagaccacgcaaggcaagaatttctcgctatttactatttcatattcttgttactagatgttctataggaataagaagaaatcgcaacccctttccgctacacataaaaatggatttcgaaagtcaatttttcttttcaatatctttactttttttcagaatcctatttttgttcttatacccatgcaatagagagcgagtgggaaaagggaggttactttttttcattttttccttaaaaaataggctttcttggaaataggaatcatggaataatctgaattccaatgtttatttctatagtataagaaaaactaattgaatcaaattcatggatttaccacgacctcggctgtgaccccatagataaaaatgcaaaatttctatcttcgagaccattgaaaaaaggcattgaacgagaaaaaatcgtccacagataatctatcgtatgccttggaagtgatataaggtgctcggaaatggttgaagtaattgaataggaggatcactatgactatagcccttggtagagttactaaagaagaaaatgatttatttgatattatggacgactggttacgaagggaccgttttgtttttgtaggatggtctggcctattgctttttccttgtgcttatttcgctttaggaggttggtttacagggacaacttttgtaacttcttggtatacccatggattggcgagttcctatttggaaggttgcaatttcttaaccgcagcagtttccacccctgccaatagtttagcacactctttgttgctactatggggcccggaagcacaaggggattttactcgttggtgtcaattaggtggtctgtggacttttgttgctctccatggggcttttgcactaataggtttcatgttacgtcaatttgaacttgctcggtctgttcaattgcggccttataatgcaatttcattctctggcccaatcgctgtttttgtttccgtattcctgatttatccactggggcaatccggttggttctttgcgccgagttttggcgtagcagcgatatttcgattcatcctcttcttccaaggatttcataattggacgttgaacccatttcatatgatgggagttgccggagtattaggcgcggctctgctatgcgctattcatggggcaaccgtgga,Gene-FindingbyComputer,Startingfromearly1980s:“Abinitio”or“denovo”algorithms:GeneMark,GenScan,FgeneSH,Genie,basedongene-structuremodelsandtrainingdata.(Ouron-goingproject:BGF,theBGIGeneFinder)HomologmethodsbasedonsequencealignmentwithknowngenesindatabasesMixedapproachusingbothstrategy:TwinScan,DifferentStagesofGene-Finding,Useallpossibleexistingprogramsandservicesonthewebwithapublic-domainorhome-madegenomeviewerWriteyourowngene-finder,trainedforthespecificorganismAdreamforthetimebeing:designaself-trainingandself-developingprogram“foranyspecies”whichwouldimproveitselfiterativelystartingfromafewavailablereads,cDNAs,andESTs,PerformanceofGene-FindersinEukaryoteGenomes,M.Q.Zhang,NatureReviewGenetics,3(2002)698-710(mostlyforthehumangenome):Nucleotidelevel:80%Exonlevel:45%Wholegenestructure:20%FgeneSHandBGFforrice(ourtestson128cDNA-confirmedsingle-genegenomicsequences):Nucleotidelevel:90%Exonlevel:60%Wholegenestructure:40%,5,3,3,5,Eachstrandcarriesthesameamountofinformation,butdifferentsetsofgenes.Twostrandsareequivalentininformationcontent.Twostrandsarenotequivalentingenecontent.Biologicalprocessing(duplication,transcription)goesfrom5to3.Findinggenesononestrandatatimeorontwostrandsatthesametime:one-passortwo-passprograms.,5-UTR3-UTR,transcribe,GenomicDNA,Pre-mRNA,splice,mRNA,translate,AAseq(proteinprimaryseq),fold,Proteinfold,start,stop,5,3,RNAPolII+,splicesomeu1u2u4u5u6RNP,ribsomeinit.+elong.factorsterm.chaperonine,ThreeScalesofSearch,Local:signalswithminimalsignature(start,stop,splicing);movablesignals(caps,promoters,polyAs,branchingpoints,someveryweak)-clustering,discriminationanalysis,variousstatisticalmodelsIntermediate:exons,introns,intergenic-Markov,semi-Markov,Hidden-Markovmodels;intronlengthdistributionGlobal:optimalcombinationoftheabove-dynamicprogramming,()【(.)(.)(.)】(),Signals:transcriptionstart(downstreamofpromoters)transcriptionend(upstreamofpoly-A)【translationstart(ctg,1/64inarandomseq.)】translationend(tag,tga,taa,3/64)(splicingdonorsite(minimalsignal=gt,1/16))splicingacceptersite(ag,1/16)branchingpoint(veryweaka),TranscriptionTranslationTranslationTranscriptionstartstartendend,()【(.)(.)(.)】(),【(Firstexon)(Internalexon)】Lastexon(Non-coding5exon)【Non-coding5exon(.)Intron】(Non-coding3exon(rare))Non-coding3exon(rare)Intergenicregion,TranscriptionTranslationTranslationTranscriptionstartstartendend,SignalandSequenceModels,eiid:equalprobabilityindependentlyandidenticallydistributedniid:non-equalprobabilityindependentlyandidenticallydistributedWWM:Windowedweightmatrix,etc.MMn:Markovchainmodelofordern:homogeneousandperiod-3MM5areusedinmanygene-findersConsensussequence,ConsensusSequences,TATAAT(Pribnovor-10box):T80A95T45A60A50T96TTGACA(-35box):T82T84G78A65C54A45CAAT(CAATor75box):GGYCAATCTTATA(TATAorGoldberger-Hognessbox):TATAWAWATG(Transcriptionstartpoint)However,inAful:ATG76%GTG22%TTG2%,GT-AGRuleforIntron,5splicingdonorsiteexonA64G73G100T100A62A68G84T6312PyNC65A100G100Nexon3splicingacceptorsite,Exonandintronsizedistribution,Algorithms,SequencemodelsandscoresforsignalsDynamicprogramming:optimalparseHiddenMarkovModel:geometricdistributionofintronlengthsSemi-HiddenMarkovModel:needssequence-generatingmodelsandlengthprobabilityforeachnodeLanguagetheoryapproach,FlowChartofGenScan,ChrisBurge(1996):A27-statesemi-HMMAsimplermodel:19-stateAmodeltakingUTRintronsintoaccount:35-state,Figure:N,intergenicregion;P,promotor;F,5UTR;,single-exongene;,initialexon;phasekinternalexon;,ter-minalexon;T,3UTR;A,polyadenylationsignal;and,phasekintron.)strand.,Problems:MinorandMajor,Ambiguitysymbols(N,W,S,R,)(1-p)atflankingD-typenodesIndelsandframe-shiftsGradienteffectsingenestructureIntronsin5-UTRsand3-UTRs:leadingto35-stateMarkovModelsAlternativesplicingandsub-optimalpathsLimitofprobabilisticmodelsDeterministicapproaches,Dycklanguage:Alanguageofnestedparentheses,ManytypesofparenthesesFinitedepthofnestingContext-freelanguageOurcase:Only3typesofparenthesesShallownestingConjecture:mayberegularlanguage,TwoTestDatasetsforRiceGene-Finders,The28469japonicafull-lengthcDNAs(Kikuchietal.,Science301(18July2003)Selectahigh-qualitysubsetwithoutoverlapswithpublicallyavailablecDNAsAsingle-geneset:500sequenceswithonegeneineachAmulti-geneset:46

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论