第三章序列比对_第1页
第三章序列比对_第2页
第三章序列比对_第3页
第三章序列比对_第4页
第三章序列比对_第5页
已阅读5页,还剩79页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

1、第三章序列比对Sequence AlignmentContentsl I. Introductionl II. Pair-wise alignment via dynamic programmingl III. Multiple alignmentl IV. Genome assembling3.1l 序列比较的根本任务是: 发现序列之间的相似性 辨别序列之间的差异l 目的:相似序列 相似的结构,相似的功能判别序列之间的同源性推列之间的进化关系序列相似性 同源(homology)- 具有共同的祖先 直向同源(Orthologous ) 共生同源(paralogous ) 相似(similari

2、ty)同源序列一般是相似的 相似序列不一定是同源的 进化趋同(同功能)lll水平转移进化趋同直向同源(a1 in species I, a1 in species II)共生同源(a1 and a2 in species I)3.1l AlignmentA mutual arrangement of two or more sequences, it exhibits where the two or more sequences are similar, and where they differ.L G P S S K Q T G K G S - S R I W D NGlobal| L|

3、N - I T K S A G K G A I M R L G D Al Alignment T G K G Local| A G K G two sequencesl Alignmentmultiple sequencesl Databases searchl Phylogenetic analysisAlignmentN-YLS NKYLS N-F-S N-FLSNYLSNKYLSNFS-LYNFLS+KFNYLSWhy do we care about sequence alignment?It can tell us something about the evolution of o

4、rganisms.We can see which regions of a gene (or its derived protein) are susceptible to mutation and which can have one residue replaced by another without changing function.Homologous genes (genes with share evolutionary origin) have similar sequences.Orthologs are genes that are evolutionarily rel

5、ated, have a similar function, but now appear in different species.Paralogs are evolutionarily related (share an origin) but no longer have the same function.You can uncover either orthologs or paralogs through sequence alignment.llllll双序列比对 Dot matrix plot3.2(Gibbs and McIntyre,1970)例一:s: SSENTIALS

6、OFSEQUENCEANALYSISt:SSENTIALSOFSEQUENCEANALYSIS例2:s: ESSENTIALSOFSEQUENCEANALYSISt:ESSENTIALANALYSIS例3:s: ESSENTIALSOFSEQUENCEANALYSISt:APRIMERONHOWTOANALYZESEQUENCESANALYSI SSEQUENCEAANALYZESSEQUENCESWindow filtering techniquel s:15l t:15J. Bacteriol., 2003, 185:1018-1026.3.3 Pair-wise alignment vi

7、a DP(1).(2).(3).(4).(5).Edit Operations Optimal AlignmentDynamic ProgrammingScoring Scheme Exact Algorithms Global: N-W AlgorithmLocal:S-W Algorithm(6).Heuristic Methods Fasta: FAST AlignmentBlast:Basic Local Alignment Search ToolPairwise AlignmentSequence s: CTTAACTSequencet: CGGATCATAn alignment o

8、f s and t:MismatchMatchTT TDeletion gapInsertion gapAC-TA CAC CG-GAT(1). Edit operationss:t:AGCACACAACACACTAone-character edit operations that turn s into t :(a,a) (a,-)(a,b)(-,b)denotes a match (no change from s to t), denotes deletion of character a (in s),denotes replacement of a (in s) by b (in

9、t), where ab,denotes insertion of character b (in s).Since the problem is symmetric in s and t, a deletion ins can be seen as an insertion in t, and vice versa.An alignment of two sequences s and t is an arrangement of s and t by position, where s and t can be padded with gap symbols toachieve the s

10、ame length:s: AGCACACAAGCACACAort:ACACACTAACACACTAIf we read the alignment column-wise, we have a protocol of edit operations that lead from s to t.S:T:AGCACACAA CACACTAAGCACACAACACACTALeft:Match(A,A)Right : Match(A,A)Delete (G,-)Replace(G,C)Match Match Match Match Match InsertMatch(C,C)(A,A)(C,C)(A

11、,A)(C,C) (-,T) (A,A)Insert Match MatchMatch(-,A) (C,C)(A,A)(C,C)Replace(A,T)DeleteMatch(C,-)(A,A)(2). Optimal alignment-Important notion for sequence analysisThe total cost of an alignment of two sequences s and t is the sum of the costs of all the edit operations that lead from s to t .l An optimal

12、 alignment of s and t is an alignment which has minimal total cost among all possible alignments.Using the unit cost min our previous example,we obtain the following cost:s: AGCACACAAG CACACA ACACACT Acost : 4t:A CACACTAcost : 2orHere it is easily seen that the left-hand assignmentis optimal under t

13、he unit cost m the edit distance., and henceAt this point, you should improve your understanding by doing some exercises.Similarity scorel Similarity scorematch: w(a,a)=8mismatch: w(a,b)= -5for abeach gap symbol: w(a,-)=w (-, b) = -3(3). Dynamic Programming(R. Bellman 1950)” Dynamic Programming” is

14、a very general optimal algorithm. It is applicable when a large search space can be structured into a succession of stages, such that:The initial stage contains trivial solutions to sub-problems;Each partial solution in a later stage can be calculated by recurring on only a fixed number of partial s

15、olutions in an earlier stage;The final stage contains the overall solution.llll(3). Dynamic Programming (the primary principle)l Best alignment at (i, j) is the best alignment previous to (i, j) plus aligning these two.l Repeat the process until reaching the two sequences end.(3). Dynamic Programmin

16、g (a gamepicking money)CGGATCATCodeAge 305oA:C:G:T:Adult Child3336C T C A A CTGrandpa60Teenager 1036336Money in each cell612max :36$m=max / (old: young) Money on line 5 $Queue s: CGGATCAT Queue: t: CTCAACTf(3). Dynamic Programming (pathways)CGGATCATou Rule: only forward not backward; along line or c

17、ross diagonallyC T C A A CTu Caution: localumsand global optimal;u Compute before you walk;u How many pathways (mn) :(m = 7)(n = 8)n+m 2iN = 65280i=nu How many calculations:CN = 2i i = 915968i=nn+mf363361236363361236181236CalculationLet si,j the most money one can get at position (i,j)The money at o

18、rigin is: s0,0=0 The money on top line and left edge is:s0,j=j*5 (j=1.n)si,0=i*5 (i=1.m)CGGATCATljo10$CT Clli15A A CTsi,jcan be computed as follows:ls+ $line (=5$)i-1, j= max si, j -1 + $line (=5$)si, js+ $ diagi-1, j -1The most money one can get until f and the corresponding path lead to it are wha

19、t we want.l?$ f3633636$1236$363361236181236Computing Si,jjj-1i-1iSm,n$dia g$lin eSi,j$lineInitializationsCGGATCAT0510152025303540C TC36336512361015AA2025CT3633630123618123635Computing S1,10 C 1G 2 G 3 A 4 T 5 C 6 A7 T80C0510152025303540363365361T2C123610315A204A255C 6 T 73633630123618123635Computing

20、 S1,2 , S2,1 , S2,2C 1G 2 G 3 A 4 T 5 C 6 A7 T80C5101520253035403633636411T123641462C 3 A4A5C363366T71236181236Computing S3,5=?0 C 1G 2 G 3 A 4 T 5 C 6 A7 T80C051015202530354036336536411T12361041462C 3 A4A1520255C36336306T7123618123635Computing S3,5=?0 C 1G 2 G 3 A 4 T5 C 6 A7 T80C051015202530354036

21、336536414651561T12361041465158872C 3 A4A1520255C36336306T7123618123635Answer S3,5=920 C 1G 2 G 3 A 4 T5 C 6 A7 T80C051015202530354036336536414651561T2C 3 A4A51236104146515887152025C 6 T 73633630123618123635All computed0 C 1G 2 G 3 A 4 T5 C 6 A7 T80C051015202530354036336536414651566166711T12361041465

22、1588792972C 3 A4A102155C366T71236181236667992125156174205161Select the dot withum Money0 C 1G 2 G 3 A 4 T5 C 6 A7 T80C051015202530354036336536414651566166711T123610414651588792972C 3 A4A102155C366T71236181236667992125156174205161Trace back for the best path0 C 1G 2 G 3 A 4 T5 C 6 A7 T80C051015202530

23、354036336536414651566166711T123610414651588792972C 3 A4A102155C366T71236181236667992125156174205161(3). Dynamic Programming (Characteristics)l A progress can be parted to stagesl Forward compute trace back for optimall Reduce computation by recursionNP O(2m+nT ) P O(m nT )ComplexitySpace: O(mn)Time:

24、 O(mn)l Filling the matrix O(mn)l Back trace O(m+n)(3). Dynamic Programming (a pathway an alignment)CGGATCATto3633C T C A A C TsC-TCAACT CGGATCA-Ts:t:36336612fAn optimal alignment of s and t- the pathway of most money!l Let s=a1a2am and t=b1b2bn .l Si,j: the score of an optimal alignment betweena1a2

25、ai and b1b2bjl Initializations:s0,0=0;si,0=i*gap (i=1.m) ; s0,j=j*gap (j=1.n)l Si,j can be computed as follows:si-1, j+ w(ai , -)($line)($line) ($diag)=+ w(-, b )smax si, j -1i, jjs+ w(a , b)i-1, j -1ijComputing Si,jjj-1i-1iSm,nw(ai,bj)w(ai,-)w(-,bj)TracebackStart from the lower right corner and tra

26、ce back to the upper left.Each arrow introduces one character at the end of each aligned sequence.A horizontal move puts a gap in the left sequence.A vertical move puts a gap in the top sequence.A diagonal move uses one character from each sequence.lllllA simple scoring schemel Match: +8(w(x, y) = 8

27、, if x = y)l Mismatch: -5 (w(x, y) = -5, if x y)l Each gap symbol: -3 (w(-,x)=w(x,-)=-3)C C+8- G-3- G-3- A-3T T+8T C-5A A+8A-3C-3T T+8 =Alignment score+12Now try this example after classSequence A: CAATTGA Sequence B: GAATCTGCTheir optimal alignment ?GAATCTGCC A A T T GA0-3-6-9-12-15-18-21-24-3-5-8-

28、11-14-4-7-10-13-6-830-3-6-9-12-15-9-11011852-1-4-12-14-38191613107-15-11-651614242118-18-7-921311213229-21-101-1108182927C A A T - T G A G A A T C T G C-5 +8 +8+8 -3+8 +8-5 = 27Global Alignment vs. Local Alignmentl global alignmentl local alignmentAn optimal local alignmentl Si,j: the score of an op

29、timal local alignment ending at ai and bjl With proper initializations, Si,j can be computed as follows.0s+ w(a,-)i -1,ji=+ w(-,bj )si ,jmaxsi ,j -1s+ w(a,b )i -1,jijlocal alignmentCGGATCATMatch: 8Mismatch: -5Gap symbol: -3C T T A A CT000000000085200852053008531302000852110000853?000local alignmentM

30、atch: 8CGGATCATMismatch: -5Gap symbol: -3C T T A A CTThe best score00000000008520085205300853130200085211000085311808525313107053021310818A A TC C- AT T8-3+8-3+8=18CGGATCATC T T A A CTThe best score00000000008520085205300853130200085211000085311808525313107053021310818local alignmentMatch: 8Mismatch

31、: -5Gap symbol: -3C T T A A CTCGGATCATThe best score00000000008520085205300853130200085211000085311808525313107053021310818Now try this example in classSequence A: CAATTGA Sequence B: GAATCTGCTheir optimal local alignment?Did you get it right?GAATCTGCC A A T T GA0000000000000085280088553050085132421

32、217516131513233432A AA AT T CT TG G8+8+8-3+8+8=G37AATCTGCC A A T T GA0000000000000085280088553050085132421217516131513233432(4). Score Schemel For DNA: match, mismatchl For proteins: different amino acid pairs receive different scoreSubstitution MatrixTwo ways: chemically, statisticallyPAM(Dayhoff)BLUSOM (Henikoff & Henikoff )l Gap penaltyConstant gap penalties Affine gap penaltiesNucleic acid PAM matricesl PAM = percent accepted mutationl 1 PAM = 1% probability of mutation at each sequence position.l A uniform PAM1 matrixAGTCA0.990.003330.003330.00333G0.003330.990.003330

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论