生物资讯相关演算法.ppt_第1页
生物资讯相关演算法.ppt_第2页
生物资讯相关演算法.ppt_第3页
生物资讯相关演算法.ppt_第4页
生物资讯相关演算法.ppt_第5页
已阅读5页,还剩49页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

1、生物資訊相關演算法Algorithms in Bioinformatics,呂學一 (中央研究院 資訊科學所) .tw/hil/,2002/10/30,Bioinformatics Algorithms,2,Today 如虎添翼,An fundamental query that significantly strengthens suffix tree Range Minima Query (RMQ) 前翼: RMQ for sequences. 後翼: RMQ for general sequences.,2002/10/30,Bioinfo

2、rmatics Algorithms,3,Hierarchy,+/-RMQ,LCA,RMQ,LCE,Document listing,Wildcard matching Fuzzy matching,2002/10/30,Bioinformatics Algorithms,4,RMQ: Range Minima Query,S: a sequence of numbers. 小(S, i, j) = k if i k j, and Sk = min(Si, Si+1, , Si).,123456789 S = 340141932 小(S, 2, 6) = 3 小(S, 4, 10) = 4 (

3、or 6).,2002/10/30,Bioinformatics Algorithms,5,The RMQ challenge,Input: a sequence S of numbers Output: a data structure D for S Time complexity Constant query time Each query 小(S, i, j) for S can be answered from D and S in O(1) time. Linear preprocessing time D can be computed in O(|S|) time.,2002/

4、10/30,Bioinformatics Algorithms,6,Nave approach,Storing the answer of 小(S, i, j) in a table for all index pairs i and j with 1 i j |S|. Query time = O(1). Preprocessing time = (|S|2).,2002/10/30,Bioinformatics Algorithms,7,Faster Preprocessing,Assumption (without loss of generality) |S| = 2k for som

5、e positive integer k. Idea: Precomputing the values of 小(S, i, j) only for those indices i and j with j i + 1 = 1, 2, 4, 8, , 2k = |S|. Preprocessing time O(|S| log |S|).,2002/10/30,Bioinformatics Algorithms,8,小(S, i, j) still in O(1) time,Let k be the (unique) integer that satisfies 2k j i + 1 2k+1

6、. Then, 小(S, i, j) is x = 小(S, i, i + 2k 1) or y = 小(S, j 2k + 1, j).,i,j,i + 2k 1,j 2k + 1,前翼,The RMQ Challenge for sequeneces,2002/10/30,Bioinformatics Algorithms,10,sequeneces,S is a sequence if Si Si 1 = 1 for each index i with 2 i |S|. For example, S = 5 6 5 4 3 2 3 2 3 4 5 6 5 6 7 + - - - - +

7、- + + + + - + + S = 3 4 3 2 1 0 -1 -2 -1 0 1 2 1 + - - - - - - + + + + -,2002/10/30,Bioinformatics Algorithms,11,前翼: The RMQ Challenge for sequeneces,Input: a sequence S of numbers Output: a data structure D for S Time complexity Constant query time Each query 小(S, i, j) for S can be answered from D

8、 and S in O(1) time. Linear preprocessing time D can be computed in O(|S|) time under the unit-cost RAM model.,2002/10/30,Bioinformatics Algorithms,12,Unit-Cost RAM model,Operations such as add, minus, comparison on consecutive O(log n) bits can be performed in O(1) time.,2002/10/30,Bioinformatics A

9、lgorithms,13,Idea: compression,Breaking S into blocks of length L = log |S|. There are B = 2|S|/log |S| blocks. Let 縮t be the minimum of the i-th block of S. 縮t = min Sj | j = (t 1) L j tL for t = 1, 2, , B. Computable in O(|S|) time. RMQ on 縮: 小(縮, x, y) O(1) query time. O(|S|) preprocessing time.

10、(Why?),Any constant c 1 is OK.,2002/10/30,Bioinformatics Algorithms,14,小(S, i, j) via 小(縮, s, t),Suppose Si is in the -th block of S. (1) Li L. Suppose Sj is in the -th block of S. (1) L j L. = 小(縮,+1,-1).,小(S, i, j) is one of 小(S, i, L) 小(S, (1)L +1, j) 小(S, (-1)L+1, L) Note that each of these thre

11、e is a query within a length-L block.,2002/10/30,Bioinformatics Algorithms,15,小(S, i, j) within a block,It remains to show how to answer 小(S, i, j) in O(1) time for any indices i and j such that (t1)L i j tL for some positive integer t with the help of some linear time preprocessing.,2002/10/30,Bioi

12、nformatics Algorithms,16,Difference sequence,The difference sequence 差 of S is defined as follows: 差i = Si+1 Si. Since S is a sequence, each 差i = 1. 小(S, i, j) can be determined from 差ij. The number of distinct patterns of a length-L difference sequence is exactly 2L = |S|.,2002/10/30,Bioinformatics

13、 Algorithms,17,Preprocessing all patterns,o(|S|) time. #row = |S| #col = log2 |S| Each entry is computable in O(log |S|) time. Answering each 小(S, i, j) takes O(1) time.,LCA: Lowest Common Ancestor,An application of RMQ for sequences,2002/10/30,Bioinformatics Algorithms,19,Lowest Common Ancestor,T i

14、s a rooted tree. 祖(x, y) is the lowest (i.e., deepest) node of T that is an ancestor of both node x and node y.,2002/10/30,Bioinformatics Algorithms,20,For example, ,1,2,3,5,6,4,7,祖(3,6),祖(5,7),2002/10/30,Bioinformatics Algorithms,21,The challenge for 祖(x, y),Input: an n-node rooted tree T. Output:

15、a data structure D for T. Requirement: D can be computed in O(n) time. Each query 祖(x, y) for T can be answered from D in O(1) time.,2002/10/30,Bioinformatics Algorithms,22,Idea: depth-first traversal,1234567890123 V=1232454642171 L=1232343432121 If Vi=x and Vj=y, then 祖(x, y)=V小(L, i, j),1,2,3,5,6,

16、4,7,2002/10/30,Bioinformatics Algorithms,23,Idea: depth-first traversal,1234567890123 V=1232454642171 L=1232343432121 1 2 3 4 5 6 7 I=1,2,3,5,6,8,12 祖(x, y)=V小(L, I(x), I(y).,O(n)-time Preprocessing Computing V and L Preprocessing L for queries 小(L, i, j). Precomputing an array I such that VIx = x f

17、or each node x.,2002/10/30,Bioinformatics Algorithms,24,Idea: depth-first traversal,1234567890123 V=1232454642171 L=1232343432121 1 2 3 4 5 6 7 I=1,2,3,5,6,8,12 祖(x, y)=V小(L, I(x), I(y).,Query time is clearly O(1).,1,2,3,5,6,4,7,LCE: Longest Common Extension,An application of LCA queries 祖(i, j).,20

18、02/10/30,Bioinformatics Algorithms,26,Longest Common Extension,Suppose A and B are two strings. Let 延(i, j) be the largest number d + 1 such that Aii+d = Bjj+d.,2002/10/30,Bioinformatics Algorithms,27,The challenge for 延(i, j),Input: two strings A and B. Objective: output a data structure D for A an

19、d B in O(|A|+|B|) time such that each query 延(i, j) can be answered from D in O(1) time.,2002/10/30,Bioinformatics Algorithms,28,Idea: Suffix Tree for A#B$,x is the i-th leaf y is the (j+|A|+1)-st leaf. The depth of 祖(x, y) is exactly 延(i, j).,x,y,祖(x, y),A-suffix,B-suffix,Wildcard Matching,An appli

20、cation of longest common extension 延(i, j),2002/10/30,Bioinformatics Algorithms,30,Wildcard Matching,Input: two strings P and S, where P has k wildcard characters ?, each could match any character of S. Output: all occurrences of P in S.,2002/10/30,Bioinformatics Algorithms,31,Nave algorithm,Suppose

21、 S has t distinct characters. Nave algorithm: Construct the suffix tree of S; For each of tk possibilities of P do Output the occurrences of P in S; Time complexity = (|S|+tk|P|).,2002/10/30,Bioinformatics Algorithms,32,Wildcard Matching via longest common extension,Suppose j1 j2 jk are the indices

22、such that Pj1 = Pj2 = = Pjk = ?. P matches Sii+|P|1 if and only if 延(i, 1) j1 1; 延(i+ j1, j1+1) j2 j1 1; 延(i+ j2, j2+1) j3 j2 1; 延(i+ jk-1, jk-1+1) jk jk-1 1; and 延(i+ jk, jk+1) |P| jk + 1.,1,j1,j2,jk,|P|,i,P,S,2002/10/30,Bioinformatics Algorithms,33,O(k|S|) time,O(|P|+|S|) = O(|S|) time: preprocess

23、ing for supporting each 延(i, j) query in O(1) time. O(|S|) iterations, each takes time O(k).,Fuzzy Matching,Another application of longest common extension 延(i, j).,2002/10/30,Bioinformatics Algorithms,35,Fuzzy Matching,Input: an integer k and two strings P and S Output: all “fussy occurrences” of P

24、 in S, where each “fussy occurrence” allows at most k mismatched characters.,2002/10/30,Bioinformatics Algorithms,36,Fuzzy occurrences,Whether P occurs in Sii+|P|-1 with k or fewer errors can be determined by j = 延(i, 1); error = 0; while (j k) then return “no”; j += 1 + 延(i + j + 1, j + 2); return

25、“yes”.,2002/10/30,Bioinformatics Algorithms,37,O(k|S|) time,O(|P|+|S|) = O(|S|) time: preprocessing for supporting each 延(i, j) query in O(1) time. O(|S|) iterations, each takes time O(k).,後翼: The RMQ (i.e., 小(S, i, j) challenge for general sequences,Another application of lowest common ancestor,200

26、2/10/30,Bioinformatics Algorithms,39,The RMQ challenge,Input: a sequence S of numbers Output: a data structure D for S Time complexity Constant query time Each query 小(S, i, j) for S can be answered from D and S in O(1) time. Linear preprocessing time D can be computed in O(|S|) time.,2002/10/30,Bio

27、informatics Algorithms,40,Idea: Minima Tree,123456789 S=432417363 小(S,i,j)=祖(i,j).,5,8,6,4,9,2,1,2002/10/30,Bioinformatics Algorithms,41,Homework 1 (Optional),Show how to construct a minima tree for any sequence S of numbers in O(|S|) time. Due in two weeks. Email your typesetted solutions to TAs, o

28、r hand in your hand-written solutions to me before the end of next lecture (in two weeks.),Listing source strings that contains a pattern string Muthukrishnan, SODA02,An application of RMQ for general sequences,2002/10/30,Bioinformatics Algorithms,43,The problem,Input: Strings S1, S2, , Sm, which ca

29、n be preprocessed in linear time. A string P. Output: The index j of each Sj that contains P.,2002/10/30,Bioinformatics Algorithms,44,Preliminary attempts,Obtaining the suffix tree for S1#S2#Sm$. Find all occurrences of P. I.e., exact string matching for S1#S2#Sm$ and P. Time = O(|P| + total number

30、of occurences of P). Obtaining the suffix tree for each Si. Determining whether P occurs in Si. I.e., substring problem for each pair Si and P. Time = O(|P|m).,2002/10/30,Bioinformatics Algorithms,45,The challenge,Input: Strings S1, S2, , Sm, which can be preprocessed in linear time. A string P. Out

31、put: The index j of each Sj that contains P. Objective O(|P| + 現(P) time, where 現(P) is the number of output indices.,2002/10/30,Bioinformatics Algorithms,46,The second attempt,Constructing the suffix tree for S1#S2#Sm$. Keeping the distinct descendant leaf colors for each internal node. Query time?

32、 Preprocessing time?,2002/10/30,Bioinformatics Algorithms,47,The second attempt,Each query takes O(|P|+現(P) time. (Why?) The preprocessing may need (m| S1#S2#Sm$|) time. (Why?) Q: Any suggestions for resolving this problem?,2002/10/30,Bioinformatics Algorithms,48,Keeping the list 彩 of leaf colors from left to right. Each internal keeps t

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论