




已阅读5页,还剩66页未读, 继续免费阅读
版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
Outline for Todays Discussion,basic facts about DNA, RNA, and proteinsexamples of computational problemsexamples of algorithmic techniques,Main Bio-molecules,DNA: encodes genetic information.RNA: copies and transports such information to produce proteins.Protein: performs various biological functions.,Basic Facts about DNA,4 nitrogenous bases: Adenine, Cytosine, Guanine, Thyminenucleotide = base + phosphate + sugar strand = a polymer of nucleotidesdouble strands: two complementary strandsbinding rule: A-T and C-G3D structure: double helixchromosome = a complete DNA molecule in a cellgenome = the whole set of chromosomes in a cell,DNA Structure,Genome at 4 Levels of Details,Example of Chromosome Sizes,Basic Facts about RNA,4 nitrogenous bases: Adenine, Cytosine, Guanine, Uracilnucleotide = base + phosphate + sugar strand = a polymer of nucleotidessingle strandbinding rule: A-U, C-G, and others.3D structure: much more complex than double helix,Basic Facts about Protein,20 amino acid residuesstrand = a polymer of amino acidssingle strandbinding rule: complicated.3D structure: much more complex than RNAs 3D structure function of a protein,Importance of Protein Folding,The 3D structure significantly determines the function.,Key Facts about Genes,gene = segment of a chromosomegene proteincodon = 3 consecutive DNA basescodon protein charactergene = partitioned into introns and exonsexon = codonsintron = “meaningless” regionGenes may be nested,DNA RNA Protein,Functional Assignment using Gene Ontology,13,601 Genes,Drosophila,Gene Number in the Human Genome,The Gene Counting Problem,The number probably will be never known exactly.,Current estimates: 30,000-40,000,Other estimates: 120,000,Gene discovery:,sequence analysis motif recognition matches to mRNA computational predictions mouse data matches experimental validation,Some Main Areas of Bioinformatics,A key goal of bioinformatics: To study biological systems based on global knowledge of genomes, transcriptomes, and proteomes.Genome: entire sets of materials in the chromosomes.Transcriptome: entire sets of gene transcripts.Proteome: entire sets of proteins.,Perspectives,A key goal of bioinformatics: To study biological systems based on global knowledge of genomes, transcriptomes, and proteomes.Genome: entire sets of materials in the chromosomes.Transcriptome: entire sets of gene transcripts.Proteome: entire sets of proteins.,Proteomics,Proteome: all proteins encoded within a genomehalf millions distinct proteins (temporal, spatial, modifications)30,000 human genesmRNA and protein expressions may not correlateProteomics: study of protein expression by biological systemsrelative abundance and stability; post-translational modificationsfluctuations as a response to environment and altered cellular needscorrelations between protein expression and disease stateprotein-protein interactions, protein complexesTechnologies:2D gel electrophoresis mass spectrometryyeast two-hybrid systemprotein chips,Protein Identification: HPLC-MS-MS,Mass/ChargeTandem Mass Spectrum,Mass/Charge,Proteins,Peptides,One Peptide,B-ions / Y-ions,Protein Identification: HPLC-MS-MS,Mass/ChargeTandem Mass Spectrum,Mass/Charge,Proteins,Peptides,One Peptide,B-ions / Y-ions,Peptide Fragmentation and Ionization,B-ion,Y-ion,Complementary: Mass(B-ion)+Mass(Y-ion) = Mass(peptide)+4H+O,Tandem Mass Spectrum,Mass / Charge,Abundance (100%),200,50,88.033,100,400,175.113,274.112,361.121,430.213,448.225,Raw Tandem Mass Spectrum,Protein Database Search,Find the peptide sequences in a protein database that optimally fit the spectrum.It does not work if the target peptide sequence is not in the database.It does not work if there is an unknown modification at some amino acid.It is very slow because it must search the entire database.E.g., SEQUEST, Yates, Univ. of Washington.,De Novo Peptide Sequencing Problem,Input: (1) the mass W of an unknown target peptide, and (2) a set S of the masses of some or all b-ions and y-ions of the peptide.Output: a peptide P such that (1) mass(P)=W and (2) S is a subset of all the ion masses of P.,Mass / Charge,Abundance (100%),50,100,274.112,361.121,Peptide Mass 429.212 Daltons,P = SWR,Mass(P) = 429.212,Ions(P) = 88.033, 175.113, 274.112, 361.121, 430.213, 448.225,Amino Acid Mass Table,Importance of Protein Folding,The 3D structure significantly determines the function.,Two Complementary Problems for Protein Folding,Protein Folding Prediction - Given a protein sequence, determine the 3D folding of the sequence. Protein Sequence Design - Given a 3D structure, determine the fittest protein sequence for the structure, i.e., one that has the smallest energy among all possible sequences when folded into the structure.,Complexity for Protein Folding Problems,Protein Folding Prediction - Given a protein sequence, determine the 3D folding of the sequence. NP-hard under various models.Protein Sequence Design - Given a 3D structure, determine the fittest protein sequence for the structure, i.e., one that has the smallest energy among all possible sequences when folded into the structure. Solvable in polynomial time under the Grand Canonical model.,Protein Identification: HPLC-MS-MS,Mass/ChargeTandem Mass Spectrum,Mass/Charge,Proteins,Peptides,One Peptide,B-ions / Y-ions,Key Steps of Sequencing DNA,Goal = determine the character sequence of a DNA molecule.Duplicate many copies.Cut the copies into fragments.Determine the character sequence of small fragments (by means of recursion or lab instruments).Compare, align, and order the fragments into the original sequence.,Polymerase Chain Reaction,Cutting DNAs,BLAST (Basic Local Alignment Search Tool),A suite of sequence comparison algorithms optimized for speed used to search sequence databases for optimal local alignments to a protein or nucleotide query,Mapping and Walking Mapping and Clone by Clone Shotgun Whole Genome Shotgun with Mate Pairs,Lab-Intense (SLOW),Compute-Intense (FAST),Comparison of Sequencing Strategies,DNA target sample,SHEAR & SIZE,e.g., 10Kbp 8% std.dev.,End Reads / Mate Pairs,CLONE & END SEQUENCE,590bp,10,000bp,Mate-Pair Shotgun DNA Sequencing,Mapping and Shotgun,Clone by Clone Shotgun sequencing,Whole Genome Sequencing Approaches,Key Steps of Sequencing DNA,Goal = determine the character sequence of a DNA molecule.Duplicate many copies.Cut the copies into fragments.Determine the character sequence of small fragments (by means of recursion or instruments).Compare, align, and order the fragments into the original sequence.,Sequencing reactions produce short reads (550bp).,Human Genome3 billion bases,Sequence read550 bases,The human genome is repeat-rich.,Many short reads look identical to each other.,GCATTA.GACCGT,CGGATAGACATAAC,CGGATAGACATAAC,CGGATAGACATAAC,CAGCAGCAGCAGCA,CAGCAGCAGCAGCA,CAGCAGCAGCAGCA,Obstacles to Genome Sequencing,Order & Orientation is Essential to Finding Genes,Exon 1,Exon 2,Exon 3,Exon 4,Exons are shuffled and unoriented, significantly impacting the ability of gene finding programs to make a correct prediction.Users consistently report finding genes that they cant find elsewhere.,But if contigs are not correctly put together:,1,4,3 reversed,2,Celeras Sequencing / SNP Discovery Center,Celera Supercomputing Facility,Celeras system is one of the most powerful civilian super-computing facilities in the worldCurrently over 1.5 teraflop of computing power in a virtual compute farm of Compaq processors with 100 terabytes storageNext phase a 100 teraflop computer,Human Genome Sequence from 5 Humans (3 females-2 males) completed,Human sequencing started 9/8/99Over 39X coverage of the genome in paired plasmid readsFirst Assembly announced June 26 2.9 billion bpPublished in Science, February 16, 2001,Evolutionary Trees,definition: a tree with distinct labels at leavesleaf labels: species, organisms, DNAs, RNAs, proteins, features, etc.,ancestralspecies,bird,plum,peach,rice,wheat,present-day species,(Just a joke!),Evolutionary Trees,leaf labels: DNA sequences,bird,plum,peach,rice,wheat,AAGT,CCAG,CCAT,CGGG,CGGC,(Just a joke!),Problem Formulation,bird,plum,peach,rice,wheat,AAGT,CCAG,CCAT,CGGG,CGGC,Input: DNA sequences of present-day speciesOutput: the true evolutionary treeQuestion: What is “true”?,Need a model!,(Just a joke!),A Fundamental Problem of Biology,Since the time of Charles Darwin, Problem: reconstruct the evolutionary history of all known species.Importance: intellectually fascinating practical benefits medicine, food Charles Robert Darwin - 1809-1882 Origin of Species - 1859,Protein Evolution,Tree of Life & Evolution of Protein Families (Dayhoff, 1978)Othologus Gene Family: Organismal and Sequence Trees Match Well,Evolutionary Tree,Relationships Between IMP Dehydrogenase Families,Evolutionary Tree,Relationships Between G-Protein-Coupled Receptor Families (One of the Largest and Most Diversified),Main Difficulties,Availability of data Hundreds of millions of species - unlikely to be all available any time soon or ever. But DNA sequences of more and more species are becoming available. Extracting information from data,Data Mining Flowchart,true tree(unknown),collect & processindividual sequences,compare & alignmultiple sequences,tree reconstructionalgorithms,tree verification(compare & refine),evolution models,generatesequences,further process,parameters,distance or characters,trees,information,refine,infer,parameters,Other Applications of Evolutionary Trees,A tree that conceptually models the evolutionary relationship of species or organismsApplications outside biology: linguistics - evolution of wordsstatistical classificationstracking computer viruses,Primary Information Captured in Evolutionary Trees,most recent common ancestor,X1,X2,X3,X4,X5,AAGT,CCAG,CCAT,CGGG,CGGC,yes,no,omit to simplify computation/structure,Why Need to Compare Trees?,For the same given set of species or organisms,different (1) data, (2) evolution models, (3) biological intuitions, or (4) tree construction algorithms may yield different trees.Tree comparison is a data mining tool for gaining information from multiple trees.,What Information to Gain from Comparison?,dissimilarity measures I.e., determine how different the given trees mon structures in multiple trees I.e., extract common evolutionary history from the trees.,How to Use InformationGained from Comparison?,dissimilarity measures Reexamine (1) data, (2) evolution models, (3) biological intuitions, or (4) tree construction mon structures in multiple trees Common information is more reliable than non-common information.,Examples of Tree Comparisons,Key points:There are lots of tree comparisons!How does one design or use tree comparisons?new tree comparison = new type of information data mining flow chart: a hunch for a certain kind of information a math definition for tree comparison algorithms find new information,Examples #1 of Tree Comparisons,Emphasis: dissimilarity measures1. Good versus Bad Edges (Robinson-Foulds distance)2. Subtree Transfer Distance-Emphasis: common information3. Maximum Common Refinement Subtree4. Maximum Agreement Subtree (more technical details),Good Edges,X1,X2,X4,X3,X5,X3,X2,X4,X1,X5,good,Tree #2,Tree #1,good,Definition: good edge = same clustering,Bad Edges,X1,X2,X4,X3,X5,X3,X2,X4,X1,X5,bad,Tree #2,Tree #1,bad,Definition: bad edge = different clusterings,External Edges are Always Good Edges,X1,X2,X4,X3,X5,X3,X2,X4,X1,X5,good,Tree #2,Tree #1,good,Definition: good edge = same clustering,Good versus Bad Edges,X1,X2,X4,X3,X5,X3,X2,X4,X1,X5,bad,good,Tree #2,Tree #1,good,bad,Robinson-Foulds distance = (1) # of bad edges (2) % of the internal edges being bad,Robinson-Foulds Distance,Measure:Robinson-Foulds distance = (1) # of bad edges (2) % of the internal edges being badIntuitions:This measure counts how often two trees have different clusterings.Computational Complexity:n = size of input treesNave Algorithm: O(n2) time.Best Algorithm: optimal O(n) time.(Day, 1985),Examples #4 of Tree Comparisons,Emphasis: dissimilarity measures1. Good versus Bad Edges (Robinson-Foulds distance)2. Subtree Transfer Distance-Emphasis: common information3. Maximum Common Refinement Subtree4. Maximum Agreement Subtree (more technical details),Basics of Maximum Agreement Subtrees,Assumption: rooted binary evolutionary trees.Concepts:Information contents of a treeEvolutionary subtreesAgreement subtreesMaximum agreement subtrees,Information Content of Evolutionary Trees,Number
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- 资阳驾校考试题库及答案
- 第一目击考试题库及答案
- 农民专业合作经济组织建设规划合同
- 电梯安装工程合同
- 重庆交安考试题库及答案
- 2025年锅炉水处理G3证理论考试练习试题有答案
- 2025年贵州六盘水公务员录用考试《行测》模拟题及答案
- 2025年轨道车专业招聘干部培训考试题库(附答案)
- 2025年广西梧州市继续教育公需科目试题及答案
- 贵州高考考试卷子及答案
- 《浮顶罐结构及工作原理》课件
- TSG21-2025固定式压力容器安全技术(送审稿)
- 《已上市化学药品药学变更研究技术指导原则(试行)》
- 【MOOC】《操作系统A》(南京邮电大学)章节中国大学慕课答案
- 水电站机电设备拆除施工方案
- 《公共数据安全评估规范》
- 银行家算法课件
- 杨梅综合产业园基础设施建设项目可行性研究报告-杨梅产业发展前景广阔配套需求日益凸显
- 2024年下半年辽宁事业单位管理单位遴选500模拟题附带答案详解
- 农产品直播带货策略
- 2024年化学检验员(中级工)技能鉴定考试题库(附答案)
评论
0/150
提交评论