生物信息学课件

上传人：扣*** IP属地：宁夏上传时间：2021-03-03 格式：PPT 页数：98 大小：2.24MB 积分：22 举报 版权申诉

已阅读5页，还剩93页未读，继续免费阅读

版权说明：本文档由用户提供并上传，收益归属内容提供方，若内容存在侵权，请进行举报或认领

文档简介

1、1,chapter 2,pairwise sequence alignment,2,sequence alignment,question: are two sequences related? compare the two sequences, see if they are similar significantly. example: pear and tear similar words, different meanings,4,biological sequences,similar biological sequences tend to be related provide

2、information on : functional structural evolutionary,5,biological sequences,common mistake: sequence similarity is not homology! homologous sequences: derived from a common ancestor,a:gene duplication and speciation events; b: othologs; c: convergent genes; and d: gene horizontal transfer,7,relation

3、of sequences,homologous: similar sequences in 2 different organisms derived from a common ancestor sequence.(同源） orthologs: similar sequences in 2 different organisms that have arisen due to a speciation event. （直系同源） functionality retained,8,relation of sequences,paralogs: similar sequences within

4、a single organism that have arisen due to a gene duplication event.（并系同源） xenologs: similar sequences that have arisen out of horizontal transfer events (symbiosis, viruses, etc)（异同源,9,relation of sequences,image source: /education/blastinfo/orthology.html,10,edit distance,

5、sequence similarity: function of edit distance between two sequences p e a r | | | t e a r,11,hamming distance,minimum number of letters by which two words differ calculated by summing up the number of mismatches hamming distance between pear and tear is 1,12,gapped alignments,biological sequences d

6、ifferent lengths regions of insertions and deletions notion of gaps (denoted by -) a l i g n m e n t | | | | | | | - l i g a m e n t,13,possible residue alignments,match mismatch substitution: among species and polymorphism or mutation: within species insertion/deletion (indels gaps,14,alignments,wh

7、ich alignment is best? a c g g a c t | | | | | a t c g g a t c t a t c g g a t c t | | | | | | | a c g g a c t,15,alignment scoring scheme,possible scoring scheme: match: +2 mismatch: 1 indel: 2 alignment 1: 5 2 1 (1) 4 (2) = 10 1 8 = 1 alignment 2: 7 2 0 (1) 2 (2) = 14 4 = 10,16,alignment methods,v

8、isual dot matrix analysis. brute force dynamic programming word-based (k-tuples (元组 ), i.e. common patterns) (such as fasta and blast,17,visual alignments (dot plots,matrix rows: characters in one sequence columns: characters in second sequence filling loop through each row; if character in row, col

9、umn match, fill in the cell continue until all cells have been examined,18,example dot plot,19,noise in dot plots,nucleic acids (dna, rna) 1 out of 4 bases matches at random dot plots can be filtered for stringency requiring: window size is considered percentage of bases matching in the window is se

10、t as threshold,20,reduction of dot plot noise,21,reduction of dot plot noise,22,human globin vs. human myoglobin,23,information inside dot plots,regions of similarity: diagonals insertions/deletions one potential application is to determine the number of coding regions (exons,外显子) contained within a

11、 processed mrna,24,insertions/deletions,25,information inside dot plots,repeats and inverted repeats inverted repeats = reverse complement used to determine folding of rna molecules,26,repeats/inverted repeats,27,available dot plot programs,every lab that does sequence analysis should have at least

12、one dot matrix program available: say, gcg software package: comparehttp:/www.hku.hk/bruhk/gcgdoc/compare.html dotplot+http:/www.hku.hk/bruhk/gcgdoc/dotplot.html dotter(http:/www.cgr.ki.se/cgr/groups/sonnhammer/dotter.html) emboss,28,shortcoming of visual methods,the shortcoming of visual methods is

13、 that they do not yield a direct measure into the similarity between two sequences,29,in order to get a measure into sequence similarity, dynamic programming can be employed,30,dynamic programming,used in computer science solve optimization problems by dividing the problem into independent sub-probl

14、ems,31,dynamic programming,scoring scheme for matches, mismatches, gaps highest set of scores defines optimal alignment between sequences match score: dna exact match; amino acids mutation probabilities,32,dynamic programming,guaranteed to provide optimal alignment given: two sequences scoring schem

15、e,33,steps in dynamic programming,initialization matrix fill (scoring) trace back (alignment,34,dp example,sequence #1: gaattcagtta; m = 11 sequence #2: ggatcga; n = 7 s(aibj) = +5 if ai = bj (match score) s(aibj) = -3 if aibj (mismatch score) w = -4 (gap penalty,35,view of the dp matrix,m+1 rows, n

16、+1 columns,36,global alignment(needleman-wunsch,attempts to align all residues of two sequences initialization: first row and first column set si,0 = w * i s0,j = w * j,37,initialized matrix (needleman-wunsch,38,matrix fill(global alignment,si,j = maximum si-1, j-1 + s(ai,bj) (match/mismatch in the

17、diagonal) si,j-1 + w (gap in sequence #1), si-1,j + w (gap in sequence #2),39,40,matrix fill (global alignment,s1,1 = maxs0,0 + 5, s1,0 - 4, s0,1 - 4 = max5, -8, -8,41,matrix fill (global alignment,s1,2 = maxs0,1 -3, s1,1 - 4, s0,2 - 4 = max-4 - 3, 5 4, -8 4 = max-7, 1, -12 = 1,42,matrix fill (globa

18、l alignment,43,filled matrix (global alignment,44,trace back (global alignment,maximum global alignment score = 11 (value in the lower right hand cell). traceback begins in position sm,n; i.e. the position where both sequences are globally aligned. at each cell, we look to see where we move next acc

19、ording to the pointers,45,trace back (global alignment,46,global trace back,g a a t t c a g t t a | | | | | | g g a t c g - a,47,checking alignment score,g a a t t c a g t t a | | | | | | g g a t c g - a + - + - + + - + - - + 5 3 5 4 5 5 4 5 4 4 5 5 3 + 5 4 + 5 + 5 4 + 5 4 4 + 5 = 11,local alignment

20、,48,49,local alignment,smith-waterman: obtain highest scoring local match between two sequences requires 2 modifications: negative scores for mismatches when a value in the score matrix becomes negative, reset it to zero (begin of new alignment,50,local alignment initialization,values in row 0 and c

21、olumn 0 set to 0,51,matrix fill(local alignment,si,j = maximum si-1, j-1 + s(ai,bj) (match/mismatch in the diagonal), si,j-1 + w (gap in sequence #1), si-1,j + w (gap in sequence #2), 0,52,matrix fill (local alignment,s1,1 = maxs0,0 + 5, s1,0 - 4, s0,1 4,0 = max5, -4, -4, 0 = 5,53,matrix fill (local

22、 alignment,s1,2 = maxs0,1 -3, s1,1 - 4, s0,2 4, 0 = max0 - 3, 5 4, 0 4, 0 = max-3, 1, -4, 0 = 1,54,matrix fill (local alignment,s1,3 = maxs0,2 -3, s1,2 - 4, s0,3 4, 0 =max 0 - 3, 1 4, 0 4, 0 = max-3, -3, -4, 0 = 0,55,filled matrix(local alignment,56,trace back (local alignment,maximum local alignmen

23、t score for the two sequences is 14 found by locating the highest values in the score matrix 14 is found in two separate cells, inplying alternative alignments producing the maximal alignment score,57,trace back (local alignment,traceback begins in the position with the highest value. at each cell,

24、we look to see where we move next according to the pointers when a cell is reached where there is not a pointer to a previous cell, we have reached the beginning of the alignment,58,trace back (local alignment,59,trace back (local alignment,60,trace back (local alignment,61,maximum local alignment,g

25、 a a t t c - a | | | | | g g a t c g a + - + + - + - + 5 3 5 5 4 5 4 5,g a a t t c - a | | | | | g g a t c g a + - + - + + - + 5 3 5 4 5 5 4 5,62,drawbacks to dp approaches,compute intensive memory intensive o(n2) space, between o(n2) and o(n3) time,63,incorporation of scoring matrices,64,certain am

26、ino acid substitutions commonly occur in related proteins from different species. since the proteins in all of the species are functional, the substitutions maintain protein structure and function. often the substitutions result in a chemically similar amino acid. other substitutions are relatively

27、rare,65,scoring matrices,match/mismatch score not bad for similar sequences does not show distantly related sequences thus we need likelihood matrix scores residues dependent upon likelihood substitution is found in nature more applicable for amino acid sequences,66,percent accepted mutation (pam or

28、 dayhoff) matrices,because changes observed in closely related proteins represent amino acid substitutions that do not significantly change the function of protein. hence they are called “accepted mutations” defined as amino acid changes “accepted” by natural selection,67,percent accepted mutation (

29、pam or dayhoff) matrices,studied by margaret dayhoff,percent accepted mutation (pam or dayhoff) matrices,amino acid substitutions alignment of common protein sequences 1572 amino acid substitutions 71 groups of protein, 85% similar,68,69,percent accepted mutation (pam or dayhoff) matrices,similar se

30、quences organized into phylogenetic trees number of amino acid changes into every other amino acid were counted relative mutabilities were by counting the number of changes of each amino acid divided by a normalization factor,70,percent accepted mutation (pam or dayhoff) matrices,the factor which me

31、ans exposure to mutation of the amino acid, is defined as f= fi *n fi is the frequency of occurrence of the amino acid in that group, and n is the total number of all amino acid changes that occurred in that group per 100 sites,71,percent accepted mutation (pam or dayhoff) matrices,this normalized t

32、he data for variations in amino acid composition, mutation rate, and sequence length. the normalized frequencies were then averaged for sequence groups,72,percent accepted mutation (pam or dayhoff) matrices,the amino acid exchange counts and mutability values were used to generate a 20 20 mutation p

33、robability matrix representing all possible amino acid changes. a detailed example of calculating the pam matrix is located in mount, p82 (table 3.2,73,pam1 matrix (see p80-81,normalized probabilities multiplied by 10000 a l a a r g a s n a s p c y s gl n g l u g l y h i s il e l e u l y s m e t p h

34、 e p r o s e r t h r t r p t y r v a l a r n d c q e g h i l k m f p s t w y v a 9867 2 9 1 0 3 8 1 7 2 1 2 6 4 2 6 2 2 2 3 5 3 2 0 2 1 8 r 1 9 9 1 3 1 0 1 1 0 0 0 1 0 3 1 1 9 4 1 4 6 1 8 0 1 n 4 1 9 8 2 2 3 6 0 4 6 6 2 1 3 1 1 3 0 1 2 2 0 9 1 4 1 d 6 0 4 2 9 8 5 9 0 6 5 3 6 4 1 0 3 0 0 1 5 3 0 0 1

35、c 1 1 0 0 9 9 7 3 0 0 0 1 1 0 0 0 0 1 5 1 0 3 2 q 3 9 4 5 0 9 8 7 6 2 7 1 2 3 1 3 6 4 0 6 2 2 0 0 1 e 1 0 0 7 5 6 0 3 5 9 8 6 5 4 2 3 1 4 1 0 3 4 2 0 1 2 g 2 1 1 1 2 1 1 1 3 7 9 9 3 5 1 0 1 2 1 1 3 2 1 3 0 0 5 h 1 8 1 8 3 1 2 0 1 0 9 9 1 2 0 1 1 0 2 3 1 1 1 4 1 i 2 2 3 1 2 1 2 0 0 9 8 7 2 9 2 1 2 7

36、0 1 7 0 1 3 3 l 3 1 3 0 0 6 1 1 4 2 2 9 9 4 7 2 4 5 1 3 3 1 3 4 2 1 5 k 2 3 7 2 5 6 0 1 2 7 2 2 4 1 9 9 2 6 2 0 0 3 8 1 1 0 1 1 m 1 1 0 0 0 2 0 0 0 5 8 4 9 8 7 4 1 0 1 2 0 0 4 f 1 1 1 0 0 0 0 1 2 8 6 0 4 9 9 4 6 0 2 1 3 2 8 0 p 1 3 5 2 1 1 8 3 2 5 1 2 2 1 1 9 9 2 6 1 2 4 0 0 2 s 2 8 1 1 3 4 7 1 1 4

37、6 1 6 2 2 1 7 4 3 1 7 9 8 4 0 3 8 5 2 2 t 2 2 2 1 3 4 1 3 2 2 1 1 1 2 8 6 1 5 3 2 9 8 7 1 0 2 9 w 0 2 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 9 9 7 6 1 0 y 1 0 3 0 3 0 1 0 4 1 1 0 0 2 1 0 1 1 2 9 9 4 5 1 v 1 3 2 1 1 3 2 2 3 3 5 7 1 1 1 1 7 1 3 2 1 0 0 2 9 9 0 1,74,percent accepted mutation (pam or dayhoff) ma

38、trices,by these scores, asn, ser, asp, and glu were the most mutable amino acids, and cys and trp were the least mutable,75,pam1 matrix,this is a markov chain matrix summation along each column = 1.0 asymmetry matrix obtained (why,76,percent accepted mutation (pam or dayhoff) matrices,pam 1 matrix c

39、an be multiplied by itself n times to give transition matrices for sequences that have undergone n mutations (why?) (markov chain model,77,percent accepted mutation (pam or dayhoff) matrices,pam 1: 1 accepted mutation event per 100 amino acids; pam 250: 250 mutation events per 100 amino acids. many

40、of them are multiple substitutions,78,percent accepted mutation (pam or dayhoff) matrices,pam 250: 20% similar; pam 120: 40%; pam 80: 50%; pam 60: 60% the pam250 matrix provides a better-scoring alignment than lower-numbered pam matrices for distantly related proteins of 14-27% similarity,79,log odd

41、s matrices,pam matrices converted to log-odds matrix calculate odds ratio for each substitution taking scores in previous matrix(pam250) divide by frequency of amino acid,80,log odds matrices,take average of log odds ratio for converting a to b and converting b to a result: symmetric matrix convert

42、ratio to log10 and multiply by 10 example: mount pp. 83,81,pam250 log odds matrix,82,blocks amino acid substitution matrices (blosum,larger set of sequences considered sequences organized into signature blocks consensus sequence formed 60% identical: blosum 60 80% identical: blosum 80,83,all blosum

43、matrices are based on observed alignments; they are not extrapolated from comparisons of closely related proteins. blosum 62 is the default matrix in blast 2.0. though it is tailored for comparisons of moderately distant proteins, it performs well in detecting closer relationships. a search for dist

44、ant relatives may be more sensitive with a different matrix,84,equivalent pam and blossum matricesthe following matrices are roughly equivalent.pam100 = blosum90 pam120 = blosum80 pam160 = blosum60 pam200 = blosum52 pam250 = blosum45,85,differences between pam and blossum1. pam matrices are based on

45、 an explicit evolutionary model (that is, replacements are counted on the branches of a phylogenetic tree), whereas the blosum matrices are based on an implicit rather than explicit model of evolution,86,differences between pam and blossum,2. the method used to count the replacements is different, u

46、nlike the pam matrix, the blosum procedure uses groups of sequences within which not all mutations are counted the same,87,differences between pam and blossum,3. the sequence variability in the alignments used to count replacements. the pam matrices are based on mutations observed throughout a global alignment, this includes both highly conserved and highly mutable regions. the blosum matrices are based only on highly conserved regions in series of a

人人文库> 全部分类> 生活休闲 > 科普知识

温馨提示

1. 本站所有资源如无特殊说明，都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
2. 本站的文档不包含任何第三方提供的附件图纸等，如果需要附件，请联系上传者。文件的所有权益归上传用户所有。
3. 本站RAR压缩包中若带图纸，网页内容里面会有图纸预览，若没有图纸预览就没有图纸。
4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
5. 人人文库网仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对用户上传分享的文档内容本身不做任何修改或编辑，并不能对任何下载内容负责。
6. 下载文件中如有侵权或不适当内容，请与我们联系，我们立即纠正。
7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

生物信息学课件

文档简介

温馨提示

最新文档

评论

生物信息学课件

文档简介

温馨提示

最新文档

评论

相关文档