生物信息学课件英文原版课件_第1页
生物信息学课件英文原版课件_第2页
生物信息学课件英文原版课件_第3页
生物信息学课件英文原版课件_第4页
生物信息学课件英文原版课件_第5页
已阅读5页,还剩61页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

Transcript analysis and reconstructionBrazil 2001,Genes,Why are there only a few tens of thousands of genes in the human genome?How do genes express themselves to manufacture the proteome?How can available sequence information be processed in order to deliver understanding of gene expression?,Genomic expression,Within eukaryotes, genes have shared basic characteristics. They have single or multiple exons and introns distributed along the gene in coding and non-coding regions with5 Flanking region with transcription regulation signalsTranscription initiation start site (5)Initiation codon for protein coding sequenceExon-intron boundaries with splice site signals at the boundariesTermination codon for protein coding sequence3 signals for regulation and polyadenylation,GCbox,CAAT,TATA,GCbox,Transcription Initiation Site,Exon 1,Initiation Codon,GT,Transcription Initiation Site,AG,GT,Intron 1,Intron 2,Exon 2,GCbox,GCbox,CAAT,TATA,Transcription Initiation,Exon 1,Initiation Codon,GT,AG,GT,AG,Intron 1,Intron 2,Exon 2,Exon 3,Stop Codon,AATAA,Poly (A) addition site,5 Flanking region,Pre-mRNA,Mature mRNA,Gene Expression,Transcription products can vary.Transcription initiation at the start site (TSS)Exon lengthExon prescence/absence in the mature transcriptAlternate transcription termination and polyadenylation,Alternative donor and acceptor splice sites,Alternative polyadenylation,Exon skipping,Examples of alternative splicing,GCbox,GCbox,CAAT,TATA,Transcription Initiation,Exon 1,Initiation Codon,GT,AG,Intron 1,Exon 3,Stop Codon,AATAA,Poly (A) addition site (s),Exon 2 SKIP,3 Flanking region,Pre-mRNA,Mature mRNA,Capturing expressed transcripts,Databases- SequencesdbESTSeveral collapsed datasetsTIGR-THCAllgenesUnigeneBodyMapSTACKSeveral more specialisedGenome Sequence as it appears:,Expression Capture,Serial Analysis of Gene ExpressionDNA fragments that act as unique markers of gene transcripts. Assay of numbers of each marker in a set of sequence yields a measure of gene expressionArrayLaydown of sequence clones to provide an organised series for hybridisation,Resolution of Captured Expression,ESTSLow resolution, broad capture, provides template for SAGE and ArraySAGEMedium resolution, need template, noise can be an issue, stoichiometry is revealed but standardisation a problemARRAYHigh resolution, need template, noise, stoichiometric resolution highest, standardisation a problem.,3,5,AAAAA,Partial cDNATranscripts,3EST,5EST,Clone/Seq vector with CLONEID,Forwards andreverse sequencingprimers,3 overlapping,5 staggered lengthdue to polymerase processitivity,What is an EST?,What potential do ESTs hold?,Expression countsConsensus sequencesAlternate expression-form characterisationIdentification of genes expressed in a pilot gene discovery projectIdentification of genes specifically expressed in a chosen library or tissue,Use of Transcripts in Completed genomes,Identification of genesExon boundariesAlternate transcriptsGenomic annotationExpression sites of encoded genesComparitive genomics,EST data quality,T27784 g609882 | T27784 CLONE_LIB: Human Endothelial cells. LEN: 337 b.p. FILE gbest3.seq 5-PRIME DEFN: EST16067 Homo sapiens cDNA 5 end AAGACCCCCGTCTCTTTAAAAATATATATATTTTAAATATACTTAAATATATATTTCTAATATCTTTAAATATATATATATATTTNAAAGACCAATTTATGGGAGANTTGCACACAGATGTGAAATGAATGTAATCTAATAGANGCCTAATCAGCCCACCATGTTCTCCACTGAAAAATCCTCTTTCTTTGGGGTTTTTCTTTCTTTCTTTTTTGATTTTGCACTGGACGGTGACGTCAGCCATGTACAGGATCCACAGGGGTGGTGTCAAATGCTATTGAAATTNTGTTGAATTGTATACTTTTTCACTTTTTGATAATTAACCATGTAAAAAATG,EST is Poor Quality data with contaminants,VectorRepeat MASK,Individual items are prone to error but an entire collection contains valuable genetic information,Overview of clustering and consensus generation,Pre-pocessing,InitialClustering,Assembly,AlignmentProcessing,ClusterJoining,Output,RepeatsVectorMask,Alignments,Consensi,Expressed Forms,Transcript reconstruction,What is an EST cluster?,Loose and stringent clustering,Stringent - greater fidelity, lower coverageOne passShorter consensiLower inclusion rate of expression-formsLoose - lower fidelity, higher coverageMulti-passLonger consensus sequences but paralogs need attentionComprehensive inclusion of expression-forms,Supervised clustering,Template for hybridisation is a transcript composite derived from:A captured full length mRNAA composite exon construct from a genomic sequenceAn assembled EST cluster consensus,Clean Short and Tight,TIGR-THC,UniGene,STACK,Long and Loose,Data apprehension and input format.,Sources: In-House, Public, ProprietaryAccession / Sequence-run IDLocation/orientationSource CloneSource library and conditions,Pre-processing,Minimum informative lengthLow complexity regionsRemoval of common contaminantsVector, Repeats, Mitochondrial, XenocontaminantsXBLAST,Repeatmasker, VecBase and othersBLIND maskingPre-clustering vs known transcripts (data reduction),Initial clustering,Stepwise clustering Multistate.sequence identityannotationverification,Assembly,Including chromatograms - SNPs and ParalogsPHRAP and CAP seriesMultiple assemblies can fragment from one input clusterfidelityalt. formserror,Alignment processing,Consensus generationAlternate formsErrorsChoosing the correct consensus,Cluster joining,Clone joiningChoosing to accept a clone annotation1 clone ID2 clone IDsAvailable parentsmRNA (incomplete/alternate)Composite(constructed from Genomic)intronic sequence 2%,Output,Alignment alternate expression-forms polymorphisms error assessmentCluster raw cluster membershipcontextual linksFormats: FASTA, GenBank, EMBL,Alignment scoring methods:,Correct position of sequence elements against each other maximizes some scoreBLAST and FASTA Heuristiccutoff and identitypairwise alignmentfast,EST clustering methods,Est sequence is littered with errors, stutters, in-dels and re-arrangementsalignment approach is sensitive to these3 only comparison,Non-alignment based scoring methods: D2-cluster,No alignment so a speedupSensitivity improved by multiplicity measurelow weight to low complexityvery error toleranttransitive closure96% ID over 100 or 150 bases.,Word table,acggtc cggtca,Multiplicity comparison,3,3,2,(d)2= 4,TIGR_ASSEMBLER,THC_BUILD: BLAST-FASTA id all overlaps and are stored.Tigr-assembler then uses rapid oligo nucleotide comparison and assembles non-repeat overlaps. (95% ID over 40bp)matching constraints on sequence endsminimum sequence id within a sequence group - more fragmented as a resultOther TIGR approaches are similar,UniGene,Unigene approach,Originally 3 only + mRNA common words of length 13 separated by no more than 2 bases.IDAnnotationShared clone IDGenbank, genomic ad dbEST DUST 100bp min MEGABLAST,Wagner et al. CSH 1999,Fragmentation Comparison,Alignment Analysis,Three subassemblies,Potential alternateexpression form,Orthologs and Paralogs,OrthologsGenes that share the same ancestral gene that perform the same biological function in different species but have diverged in sequence makeup due to selective evolutionParalogsGenes within the same genome that share an ancestral gene that perform diverse biological functions.,Needs,Functional assignmentsExpression states of alternate forms and their sites of expressionExon level resolution of expressionRepresentative forms for application to arraysPhysical gene locationsRelationship to disease,Exploration,Availability of genomic sequence and partial transcription products means characterisation of alternate transcription can begin in earnest.Contribution to variation of expressed products and effects on biology are likely to be significant,How to trap useful genome sequence to manufacture a genome virtually?,Gene level approachTrap Expressed Sequence Tags1.8 M tags, 35-100K genesCombine to form virtual genesAnnotate and analyse these genesCorrelate with phenotype(s) = diseaseUnderstand the expression basis of disease,Reconstruction of transcripts,Derive understanding of expressed gene productsUse of expressed sequence data requires complex processingProcessed datasets are badly neededCapture a first glimpse of a genomes activitesGenomic level sequence is the final state, but its products can provide powerful information very early.Characterize underlying gene structureExon boundaries are difficult to define accurately and consistentlyAssess effect of an intervention on gene expression productsA rough EST profile is a quick identifier of key expression productsAssociate isoforms with expression states Expression forms vary, how and when? What does a full length cDNA really mean?,Why is transcript data a problem?,Transcript Data,Full length cDNAGenBank has many entries that confuse full length with complete Coding SequencePartial cDNARedundant partial cDNA sequencesExon CompositeAll confirmed exons combined to form a complete transcriptExpressed Sequence TagSingle pass sequenceGenome Survey SequenceSingle pass sequence Small genomes contain more coding sequences in GSS than larger genomes,Genome Sequence:Characterizing underlying gene structure,Fanfare fragmentFirst Pass AnnotatedExon boundariesPredictedCross species conservationTranscript confirmationComposite exon transcriptHow do you define a transcript?,STACKing approach,Distill quality from quantityAccurate consensus sequence representationIdentify expression variation, both spatial and developmentalFacilitate better understanding of gene expressionExon-level gene expression profileIntegration of expression with genome sequenceConfirm and discover expressed exonsProvide gene candidacy deliveryIntegrate with phenotype,STACKPACK,- C+, MySQL, HTML, Java,stackPACK Schema,ALL alternate expression forms are saved and accessible.,WebProbe - View by clonelink accession,Entering a project name and cluster accession number displays the clonelink Consensus View.,Clonelink cluster ID,Cluster ID,Contig ID,Input EST accession numbers,Link to corresponding UniGene entry,Alignment and Analysis,PHRAP Alignmentfirst alignment createdall ESTs in one alignmentAlignment AnalysisCRAW used to look for subassembliesIdentifies potential alternate expression formsCRAW AlignmentFinal alignment for each subassemblyConsensus AnalysisStatistics used to select best consensusNotes degree of matching between EST & consensus,The Value of Cluster Data,Microarray StudiesClusters represent unique forms associated with a specific stateGene DiscoveryUnique transcripts revealed in association with expression libraries especially in little studied organismsFunctional AnnotationVirtual genes can be searched against the database to provide functional annotation of the products of a genomeExpressed Gene StructureExons boundaries are revealed by transcript confirmation,How to trap useful genome sequence to manufacture a genome virtually?,Gene level approachTrap Expressed Sequence TagsCombine to reconstruct virtual genesMaufacture a substrate for microarray studiesAnnotate and analyse these genesCompare between speciesSpecies-specific characteristicsReveal genes under selection,Raw ESTs,Clusters,Alignments,Consensi,Joined Consensi,predict CDS,Protein Fragments,NNNNN,Virtual ProteinSequence and transcriptreconstruction,Detection of virulence genes in malarial pathogensRahlston Muller,Reconstruction of transcripts from gene expression projects in t

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论