生物信息学应用及主要算法.ppt_第1页
生物信息学应用及主要算法.ppt_第2页
生物信息学应用及主要算法.ppt_第3页
生物信息学应用及主要算法.ppt_第4页
生物信息学应用及主要算法.ppt_第5页
免费预览已结束,剩余92页可下载查看

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

生物信息学应用及主要算法,美国国家卫生研究院(nih)的定义: bioinformatics (research, development, or application of computational tools and approaches for expanding the use of biological, medical, behavioral or health data, including those to acquire, store, organize, archive, analyze, or visualize such data.) 为拓展生物学、医学、行为学和卫生学数据的用途,而进行有关计算机方法手段的研究、开发与应用,包括此类数据的采集、存贮、整理、归档、分析与可视化。 computational biology (the development and application of data-analytical and theoretical methods, mathematical modeling and computational simulation techniques to the study of biological, behavioral, and social systems. ) 开发和应用数据分析、理论方法、数学模型和计算机仿真技术,用于生物学、行为学和社会群体系统的研究。,bioinformatics,computational biology,what is bioinformatics / computational biology ?,two aspect of bioinformatics,theoretical studies,application,data analysis,algorithms,tools,applications,发现生物学 规律,,解读生物 遗传密码,认识生命的本质,研究基因组数据 之间的关系,分析现有的 基因组数据,利用数学模型 和人工智能技术,后基因组时代新的研究方式,一、分子生物学基础,生物分类 真核生物 原核生物 古细菌 真核细胞eukaryotic cells 原核细胞 prokaryotic cells,the chemical basis of life,types of biological molecules (1),单糖二糖寡糖多糖,types of biological molecules (2),脂类lipid,types of biological molecules (4),氨基酸 amino acids、肽键peptide bond 、 f(c-n) y(c-c) 角,types of biological molecules (4),an a-helix,two stranded b-sheet,b-turns are four amino acids big and are stabilized by i-i+4 h-bonds.,protein domain,a protein domain is a part of protein sequence and structure that can evolve, function, and exist independently of the rest of the protein chain.,each domain forms a compact three-dimensional structure and often can be independently stable and folded. many proteins consist of several structural domains. one domain may appear in a variety of different proteins.,molecular evolution uses domains as building blocks and these may be recombined in different arrangements to create proteins with different functions.,types of biological molecules (5),types of biological molecules (6),核苷酸nucleotide 、碱基base、 核酸nucleic acid、dna、rna,types of biological molecules (7),rrna、trna、mrna,levels of organization of chromatin,dna 组蛋白histone 核小体nucleosome 核纤维nucleosome filament 染色体chromosome,an overview of the flow of genetic information,中心法则central dogma of genetics 基因表达gene expression,原核细胞的基因结构 gene structure of prokaryote,orf1,orf2,orf3,mrnas,proteins,genome dna,orf1,orf2,orf3,transcription,translation,transcript,transcription initiation site,translation start site,1 mrna, many proteins.,1 transcript, many proteins.,transcription termination site,原核生物,gene,organization of a bacterial operon,基因的结构:原核生物基因的操纵子,真核细胞的基因结构 gene structure of eukaryote,1,2,3,4,5,1,2,3,4,5,mrnas,proteins,translation,transcription,1,2,3,5,an,an,r,r,91%,9%,alterative products,cap,cap,genome dna,exon,3-utr,intron,5-utr,1 mrna, 1 protein.,1 gene, many proteins.,18%,82%,alterative products,gene,liver cells,neurons,真核生物,organization of a eukaryotic gene,基因的结构:真核生物基因,二、生物学数据库简介,一级数据库 数据库中的数据直接来源于实验获得的原始数据,只经过简单的归类整理和注释 二级数据库 对原始生物分子数据进行整理、分类的结果,是在一级数据库、实验数据和理论分析的基础上针对特定的应用目标而建立的 。,生物信息 学数据库 工具,生物信息数据库,染色体,核酸,蛋白质,基因组图谱,dna序列,蛋白质序列,蛋白质结构,基因组 数据库,核酸序列 数据库,蛋白质序列 数据库,蛋白质结构 数据库,二级数据库 复合数据库,基因组作图,序列测定,结构测定,国际著名的生物信息中心,ncbi national center for biotechnology information (us) ebi european bioinformatics institute (eu) hgmp human genome mapping project resource centre (uk) expasy expert of protein analysis system (switzerland ) cmbi centre of molecular and biomolecule (the netherlands) angis national genome information service (australia) nig national institute of genetics (japan) generally accessible through the web google: http:/www.google./ biohunt: /biohunt/ amos links: www.expasy.ch/alinks.html nucleic acids research /nar/database/c/ http:/bioinformatics.ca/links_directory/,new data,不同数据库的序列格式,在运行序列分析软件中遇到的首要问题就是如何通过不同的程序使用不同的序列格式。这些格式都是标准ascii码文件,但在显示各种信息或序列本身的某些字符或字有所不同。,1 genbank中dna序列格式 2 embl序列格式 3 swissprot序列格式 4 fasta序列格式 5 nbrf序列格式 6 intelligenetics序列格式,7 gcg序列格式 8 pir/codata序列格式 9 plain/ascii.staden序列格式 10 asn.1序列格式 11 gde格式,通用的序列数据库的格式 fasta格式(或pearson格式),序列文件的第一行是由大于符号()打头的任意文字说明,主要为标记序列用。 从第二行开始是序列本身,标准核苷酸符号或氨基酸单字母符号。通常核苷酸符号大小写均可,而氨基酸一般用大写字母。 文件中和每一行都不要超过80个字符(通常60个字符)。,核酸序列,氨基酸序列,最重要的核酸序列数据库,欧洲分子生物学实验室的embl http:/www.embl-heidelberg.de 美国生物技术信息中心的genbank /web/genbank/index.html 日本遗传研究所的ddbj http:/www.ddbj.nig.ac.jp/,ebi,genbank,ddbj,embl,embl,entrez,srs,getentry,nig,cib,ncbi,nih,submissions updates,submissions updates,submissions updates,microbial genome data resource,individual genomes ncbis entrez gene , ebis integr8, ucsc genome browser multigenome cmr, ergo, microbes online, puma2, img, and mbgd organism-specific annotation resources peergad, asap integrated annotation resources seed, puma2 metagenome resources genbank, img/m, m,protein sequence database,swissprot + trembl the protein information resource (pir) uniprot,protein domain database,prosite /prosite/ prodom http:/prodom.prabi.fr/prodom/current/html/home.php pfam http:/pfam.sanger.ac.uk/ smart http:/smart.embl-heidelberg.de/ interpro http:/www.ebi.ac.uk/interpro/,protein structure database,pdb / pdbsumhttp:/www.ebi.ac.uk/pdbsum/,structural classification of proteins,scop (structural classification of proteins) http:/scop.mrc-lmb.cam.ac.uk/scop/ cath (class, architecture, topology, homology) /,dssp (definition of secondary structure of proteins) 蛋白质二级结构构象参数数据库 dssp的网址:http:/www.cmbi.kun.nl/gv/dssp/ fssp (families of structural similar proteins) 蛋白质家族数据库 fssp的网址:http:/www2.embl-ebi.ac.uk/dall/fssp/ hssp(homology derived secondary structure of proteins) 同源蛋白质数据库 hssp的网址: http:/www.cmbi.kun.nl/gv/hssp/,蛋白质二级结构数据库,代谢相关的数据库,enzyme /enzyme/ brenda / kegg http:/www.genome.jp/kegg/ biocyc /,database of functional genomics,arrayexpress http:/www.ebi.ac.uk/arrayexpress/ gene expression atlas http:/www.ebi.ac.uk/gxa/ intact http:/www.ebi.ac.uk/intact/main.xhtml swiss-2dpage /ch2d/,三、生物信息学的主要理论方法,基于数据挖掘(知识发现)的方法(data-mining, knowledge discovery) extracts the hidden patterns from huge quantities of experimental data, and forms hypotheses as a result. 基于模拟分析的方法(simulation-based analysis) tests hypotheses with in silico experiments, providing predictions to be tested by in vitro and in vivo studies.,生物信息学中的主要算法,生物信息学中的主要算法,四、生物信息学方法及其应用,数据获取 相似性搜寻:blast 数据分析 多序列比对、系统发育分析、motif提取、功能位点预测 蛋白质结构模型预测 同源模型预测,why do database search?,the sequence itself is not informative; it must be analyzed by comparative methods against existing databases to develop hypothesis concerning relatives and function does the dna sequence contain a gene? is the gene a member of a known gene family? what is the encoded protein what is the function of the protein? do other organisms have the protein or the gene?,sequence implications 序列的涵义,dna sequence of gene determine amino acid sequence of a protein. 基因的dna序列决定其蛋白质的氨基酸序列 primary sequence determines structure and function of a gene. 基因的一级结构决定空间结构和功能 proteins with similar sequences and structures have similar functions 蛋白质序列和结构相似,其功能相似 similar sequences should have long regions of identical or similar residues. why? 相似的序列必定有长片段的相似或一致的区域 very rarely have functional convergence without sequence or structure similarity. 功能相似但是序列和结构不相似的现象极为罕见,important terms for sequence similarity searching with very different meanings,similarity (相似性) the extent to which nucleotide or protein sequences are related. the extent of similarity between two sequences can be based on percent sequence identity and/or conservation. in blast similarity refers to a positive matrix score. identity (同一性) the extent to which two (nucleotide or amino acid) sequences are invariant. homology(同源性) similarity attributed to descent from a common ancestor. it is your responsibility as an informed bioinformatician to use these terms correctly: a sequence is either homologous or not. dont use % with this term!,some definitions,scoring分值系统.定量描述比对的好坏。 quantifies the goodness of alignment. exact match has highest score, substitution lower score and insertion and gaps may have negative scores. substitution matrix替代矩阵. a symmetrical 20*20 matrix (20 amino acids to each side). each element gives a score that indicates the likelihood that the two residue types would mutate to each other in evolutionary time. gap penalty空位罚分. evolutionary events that makes gap insertion necessary are relatively rare, so gaps have negative scores. three types: single gap-open penalty. this will tend to stop gaps from occuring, but once they have been introduced, they can grow unhindered. gap penalty proportional to the gap length. works against larger gaps. gap penalty that combines a gap-open value with a gap-length value.,dna similarity scoring common method,terminal mismatches (0) match score (10) mismatch penalty (-9) gap penalty (50) gap extension penalty (3),dna defaults,substitution matrices 常用替代矩阵,pam-dayhoff blosum-henikoff,ckhvfcrvci ckkcfc-kcv ckhvfcrvci ckkcfck-cv c-khvfcrvci ckkc-fc-ckv ckh-vfcrvci ckkc-fc-kcv,however if we used the more realistic pam 250 substitution matrix then these alignments would have different scores (and the nws algorithm would have picked the alignment with the highest one).,score with pam 250 and gap penalty -10,36 + 5 + 0 2 + 9 2 + 4 10 = 40,36 + 5 + 0 2 + 9 + 5 + 4 10 = 47,36 + 5 3 + 9 2 + 4 3x10 = 19,36 + 5 + 0 + 9 2 + 4 3x10 = 22,gap penalty is important; biology does not like gaps,blast searches biodatabase,blast 是由美国国立生物技术信息中心(ncbi)开发的一个基于序列相似性的数据库搜索程序。 blast是“局部相似性基本查询工具”(basic local alignment search tool)的 缩写。 blast 是一个序列相似性搜索的程序包,其中包含了很多个独立的程序,这些程序是根据查询的对象和数据库的不同来定义的。比如说查询的序列为核酸,查询数据库亦为核酸序列数据库,那么就应该选择blastn程序。,51,blast程序,details of the blast algorithm,3 steps: compile a list of high-scoring words scan database for “hits” extend hits,blast word matching,meaavkeeisvedeavdkni mea eaa aav avk vke kee eei eis isv .,break query into words:,break database sequences into words:,database sequence word lists rtt aaq sdg kss srw lln qel rwy vki gkg dki nis lfc wdv aav kvr pfr dei ,compare word lists,query word list: mea eaa aav avk vkl kee eei eis isv,?,compare word lists by hashing (allow near matches),eleprrpryrvpdvlvadppiarlsvsgrdensveltmeat,tdvrwmsetgiidvflllgpsisdvfrqyasltgtqalpplfslgyhqsrwny,iwldieeihadgkryftwdpsrfpqprtmlerlaskrrvklvaivdph,mea eaa aav avk klv kee eei eis isv,find locations of matching words in database sequences,hvtgrsaf_fsyygygcycglgtgkglpvdatdrccwa,qsvfdyiyygcycgwglg_gk_prda,use two word matches as anchors to build an alignment between the query and a database sequence. then score the alignment.,query:,seq_xyz:,e-val=10-13,hsps are aligned regions,the results of the word matching and attempts to extend the alignment are segments - called hsps (high-scoring segment pairs) blast often produces several short hsps rather than a single aligned region,blast parameters,nucleotide blast (blastn) word size 7, 11 or 15 (analagous to ktup) protein-protein blast (blastp) word size 2 or 3 matrix pam30 or 70, blosum80, 62 or 45 gap 12/1, 11/1, 10/1, 9/2, 8/2, 7/2,59,blast程序评价序列相似性的两个数据,score:使用打分矩阵对匹配的片段进行打分,这是对各对氨基酸残基(或碱基)打分求和的结果,一般来说,匹配片段越长、 相似性越高则score值越大。 e value:在相同长度的情况下,两个氨基酸残基(或碱基)随机排列的序列进行打分,得到上述score值的概率的大小。e值越小表示随机情况下得到该score值的可能性越低。,62,sensitivity vs. selectivity,sensitivity: 尽可能多地搜索到具有一定相似性的序列的能力。 selectivity: 尽可能准确地搜索到对研究目的有用的相似性的序列的能力。,chite -adkpkrplsaymlwlnsaresikrenpdfk-vtevakkggelwrglkd wheat -dpnkpkrapsaffvfmgefreefkqknpknksvaavgkaagerwkslse trybr kkdsnapkramtsfmffssdfrs-khsdls-ivemskaagaawkelgp mouse -kpkrprsayniyvsesfq-eakdds-aqgklklvneawknlsp *. : .: . : . . * . *: * chite aatakqnyiralqeyerngg- wheat anklkgeynkaiaaynkgesa trybr aekdkerykrem- mouse akddrirydnemksweeqmae * : .* . :,potential uses of a multiple sequence alignment?,extrapolation,motifs/patterns,phylogeny,profiles,struc. prediction,multiple alignments are central to most bioinformatics techniques.,protein sequences that are 30% identical often have the same structure and function,profiles谱 and motifs模体,一个序列模体是一段局部保守的区域或由一组序列共有的短的序列模式(pattern) a sequence motif is a locally conserved region of a sequence or a short sequence pattern shared by a set of sequences. 模体指用于预测分子功能、结构特征或家族关系的模式 the term motif refers to any sequence pattern that is predictive of a molecules function, a structural feature, or a family membership. motifs can be detected in proteins, dna and rna sequences, but they most commonly refer to protein motifs. 模体可以表示为不同的形式因不同计算目的 motifs can be represented for computational purposes as flexible patterns k,r-r-p-c-x(11)-c-v-s (qualitative, unweighted; see the prosite database at ) position-specific scoring matrices (pssm, see next page) profile hidden markov models (hmm). these are rigorous probabilistic formulation of a sequence profile. they contain the same probability information as pssms but can also account for gaps.,profiles谱 and motifs模体,profile the conserved regions of a multiple sequence alignment represented as a matrix. a profile can allow for gaps in the sequence alignments. the columns of the matrix contain substitution scores for the amino acids. the rows contain scores for matching columns of the profile to a test sequence. profile : a table which indicates the likelihood that any sequence symbol or gap would occur at any particular position in a “consensus共有或调和“ sequence generated from a set of aligned sequences,position specific scoring matrix,this corresponds to the flexible pattern of the paired box: k,r-r-p-c-x(11)-c-v-s,a b c d e f g h i k l m n p q r s t v w x y z * - -22 -22 -35 -26 -15 -37 -30 -9 -38 35 -36 -23 -16 -34 -5 53 -23 -24 -35 -40 -19 -31 -9 0 0 -51 -52 -62 -57 -46 -64 -59 -33 -66 -16 -63 -49 -44 -64 -34 70 -51 -53 -63 -64 -46 -57 -40 0 0 -42 -58 -59 -55 -53 -68 -59 -54 -63 -51 -65 -57 -62 73 -54 -56 -50 -53 -59 -72 -51 -69 -54 0 0 -42 -69 99 -75 -84 -49 -66 -72 -43 -76 -54 -53 -62 -79 -74 -75 -51 -48 -42 -65 -58 -59 -79 0 0 -21 -38 -19 -41 -30 -29 -43 -36 6 32 -16 -13 -35 -44 -25 -15 -34 -22 47 -41 -18 -36 -27 0 0 -21 6 -8 -12 -27 -7 -25 -13 26 -22 23 8 30 -39 -21 -23 -20 -13 10 -30 -9 -19 -24 0 0 -31 -40 -21 -43 -34 -23 -48 -36 50 33 -9 -8 -37 -47 -27 -17 -39 -28 5 -46 -20 -33 -30 0 0 -27 -36 -24 -38 -30 -12 -40 -30 -3 31 39 3 -32 -42 -20 -11 -35 -28 -10 -37 -16 -29 -24 0 0 -5 11 -7 -8 -18 -24 -15 -11 2 -17 -17 -13 35 -32 -17 -20 20 -2 23 -33 -7 -26 -18 0 0 24 -20 0 -22 -19 -21 -12 -20 5 -19 -12 -9 -16 -24 -19 -22 21 0 24 -29 -7 -25 -19 0 0 21 11 -3 -6 -16 -28 -10 -9 -19 -13 -25 -17 33 -26 -13 -16 2 28 -10 -35 -8 -25 -14 0 0 -3 -17 -4 -21 -21 -11 -19 -18 -1 -20 19 2 -12 -29 -19 -21 20 27 -3 -30 -6 -21 -20 0 0 -18 16 -17 33 -6 -20 -26 52 2 -21 -17 -13 -5 -35 -12 -19 -21 -16 20 -30 -10 -8 -10 0 0 -26 -41 -12 -45 -40 10 -43 -10 30 -33 5 45 -37 -44 -27 -31 -34 -21 7 -4 -17 45 -33 0 0 -27 12 -22 33 -13 -8 -28 -21 -10 -27 -5 42 -15 -40 -20 -28 -28 -24 -14 73 -14 -5 -17 0 0 -42 -69 99 -75 -84 -49 -66 -72 -43 -76 -54 -53 -62 -79 -74 -75 -51 -48 -42 -65 -58 -59 -79 0 0 -40 -73 -33 -75 -63 -45 -72 -68 -6 -66 -29 -28 -71 -71 -65 -67 -59 -40 64 -57 -45 -56 -64 0 0 -25 -40 -35 -44 -45 -59 -39 -45 -60 -47 -63 -56 -36 -55 -47 -48 61 -24 -52 -66 -39 -57 -46 0 0,automated multiple alignment methods,multi-dimensional dynamic programming 多维动态规划 progressive alignment 步进比对 iterative alignment 迭代比对,multi-dimensional dynamic programming 多维动态规划算法 simultaneous multiple alignment,准确但计算量巨大: 2 sequences of length n n2 comparisons comparison number increases exponentially i.e. nn where n is the length of the sequences, and n is the number of sequences impractical for even a small number of short sequences quite quickly,progressive alignment,1987年feng和doolittle 设计 devised by feng and doolittle in 1987. 本质上是一种启发式算法 ,因此不保证比对结果“最优” essentially a heuristic method and as such is not guaranteed to find the optimal alignment. 需要以n-1+n-2+n-3.n-n+1两两比对作为起点 requires n-1+n-2+n-3.n-n+1 pairwise alignments as a starting point can reduce complexity 最成功的实现是clustal most successful implementation is clustal (des higgins),overview of clustalw procedure,1 peeksavtalwgkvn-vdevgg,2 geekaavlalwdkvn-eeevgg,3 padktnvkaawgkvgahageyga,4 aadktnvkaawskvgghageyga,5 ehewqlvlhvwakveadvaghgq,hbb_human 1 -,hbb_horse 2 .17 -,hba_human 3 .59 .60 -,hba_horse 4 .59 .59 .13 -,myg_whale 5 .77 .77 .75 .75 -,hbb_human,hbb_horse,hba_horse,hba_human,myg_whale,2,1,3,4,2,1,3,4,alpha-helices,quick pairwise alignment: calculate distance matrix,neighbor-joining tree (guide tree),progressive alignment following guide tree,clustal w,iterative methods,repeat the alignment of subgroups of the sequences. align the subgroups into a global alignment. solution: iterative alignment: remove sequence from alignment and realign repeat realignment until the alignment score converges multalin recalculates the pair-wise scores during production of the progressive alignment. uses these scores to recalculate the dendrogram. repeats this until no more increase in the alignment score.,3. mutli-align websites,clustalw http:/www.ebi.ac.uk/clustalw/ multalin http:/www.toulouse.inra.fr/multalin.html t-coffee / maff http:/myhits.isb-sib.ch/cgi-bin/mafft muscle http:/www.bioinformatics.nl/tools/muscle.html probcons /,phylogenetics 系统发生学,系统发生学:(phylogenetic systematics系统分类学、cladistics遗传分类学 ) 是基于生物进化史的生物分类方法。 it was developed by willi hennig, a german entomologist, in 1950,经典系统发生学:主要是物理或表型特征如生物体的大小、颜色、触角个数 现代系统发生学:利用从遗传物质中提取的信息作为物种特征,如具体地说就是核酸序列或蛋白质分子 关于现代人起源的研究: 线粒体dna所有现代人都是一个非洲女性的后代,分子进化 molecular evolution,生物进化过程中生物大分子的演变现象。主要包括蛋白质分子的演变、核酸分子的演变和遗传密码的演变。 源自中性学说

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论