




已阅读5页,还剩19页未读, 继续免费阅读
版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
practical suffix tree constructionsandeep tata richard a. hankins jignesh m. pateluniversity of michigan24abstractlarge string datasets are common in a numberof emerging text and biological database applications.common queries over such datasets includeboth exact and approximate string matches. thesequeries can be evaluated very efficiently by usinga suffix tree index on the string dataset. althoughsuffix trees can be constructed quickly in memoryfor small input datasets, constructing persistenttrees for large datasets has been challenging.in this paper, we explore suffix tree constructionalgorithms over a wide spectrum of data sourcesand sizes. first, we show that on modern processors,a cache-efficient algorithm with o(n2) complexityoutperforms the popular o(n) ukkonenalgorithm, even for in-memory construction. forlarger datasets, the disk i/o requirement quicklybecomes the bottleneck in each algorithms performance.to address this problem, we present abuffer management strategy for the o(n2) algorithm,creating a new disk-based construction algorithmthat scales to sizes much larger than havebeen previously described in the literature. ourapproach far outperforms the best known diskbasedconstruction algorithms.1 introductionquerying large string datasets is becoming increasinglyimportant in a number of emerging text and life sciencesapplications. life science researchers are often interestedin explorative querying of large biological sequencedatabases, such as genomes and large sets of protein sequences.many of these biological datasets are growingat exponential rates for example, the sizes of the sequencedatasets in genbank have been doubling every six-permission to copy without fee all or part of this material is granted providedthat the copies are not made or distributed for direct commercialadvantage, the vldb copyright notice and the title of the publication andits date appear, and notice is given that copying is by permission of thevery large data base endowment. to copy otherwise, or to republish,requires a fee and/or special permission from the endowment.proceedings of the 30th vldb conference,toronto, canada, 2004teen months 31. consequently, methods for efficientlyquerying large string datasets are critical to the success ofthese emerging database applications.suffix trees are versatile data structures that can helpexecute such queries very efficiently. in fact, suffix treesare useful for solving a wide variety of string based problems17. for instance, the exact substring matching problemcan be solved in time proportional to the length of thequery, once the suffix tree is built on the database string.suffix trees can also be used to solve approximate stringmatching problems efficiently. some bioinformatics applicationssuch as mummer 10, 11, 22, reputer 23,and oasis 25 exploit suffix trees to efficiently evaluatequeries on biological sequence datasets. however, suffixtrees are not widely used because of their high cost of construction.as we show in this paper, building a suffix treeon moderately sized datasets, such as a single chromosomeof the human genome, takes over 1.5 hours with the bestknown existing disk-based construction technique 18. incontrast, the techniques that we develop in this paper reducethe construction time by a factor of 5 on inputs of thesame size.even though suffix trees are currently not in widespreaduse, there is a rich history of algorithms for constructingsuffix trees. a large focus of previous research has been onlinear-time suffix tree construction algorithms 24, 32, 33.these algorithms are well suited for small input stringswhere the tree can be constructed entirely in main memory.the growing size of input datasets, however, requires thatwe construct suffix trees efficiently on disk. the algorithmsproposed in 24, 32, 33 cannot be used for disk-based constructionas they have poor locality of reference. this poorlocality causes a large amount of random disk i/o once thedata structures no longer fit in main memory. if we naivelyuse these main-memory algorithms for on-disk suffix treeconstruction, the process may take well over a day for asingle human chromosome.large (and rapidly growing) size of many string datasetsunderscores the need for fast disk-based suffix tree constructionalgorithms. a few recent research efforts havealso considered this problem 4,18, though neither of theseapproaches scales well for large datasets (such as a largechromosome, or an entire eukaryotic genome).in this paper, we present a new approach to efficiently36construct suffix trees on disk. we use a philosophy similarto the one in 18. we forgo the use of suffix links in returnfor a much better memory reference pattern, which translatesto better scalability and performance for large trees.the main contributions of this paper are as follows:1. we introduce the “top down disk-based” (tdd)approach to building suffix trees efficiently for awide range of sizes and input types. this technique,includes a suffix tree construction algorithmcalled pwotd, and a sophisticated buffer managementstrategy.2. we compare the performance of tdd with the popularukkonens algorithm 32 for the in-memory case,where all the data structures needed for building thesuffix trees are memory resident (i.e. the datasets are“small”). interestingly, we show that even thoughukkonen has a better worst case theoretical complexity,tdd outperforms ukkonen on modern cachedprocessors, since tdd incurs significantly fewer processorcache misses.3. we systematically explore the space of data sizes andtypes, and highlight the advantages and disadvantagesof tdd with respect to other construction algorithms.4. we experimentally demonstrate that tdd scalesgracefully with increasing input size. using the tddprocess, we are able to construct a suffix tree on theentire human genome in 30 hours (on a single processormachine)! to our knowledge, suffix tree constructionon an input string of this size (3 billion symbolsapprox.) has yet to be reported in literature.the remainder of this paper is organized as follows:section 2 discusses related work. the tdd technique isdescribed in section 3, and we analyze the behavior of thisalgorithm in section 4 . section 5, presents the experimentalresults, and section 6 presents our conclusions.2 related worklinear time algorithms for constructing suffix trees havebeen described byweiner 33, mccreight 24, and ukkonen32. ukkonens is a popular algorithm because itis easier to implement than the other algorithms. it isan o(n), in-memory construction algorithm based on theclever observation that constructing the suffix tree can beperformed by iteratively expanding the leaves of a partiallyconstructed suffix tree. through the use of suffix links,which provide a mechanism for quickly traversing acrosssub-trees, the suffix tree can be expanded by simply addingthe i+1 character to the leaves of the suffix tree built on theprevious i characters. the algorithm thus relies on suffixlinks to traverse through all of the sub-trees in the main tree,expanding the outer edges for each input character. however,they have poor locality of reference since they traversethe suffix tree nodes in a random fashion. this leads topoor performance on cached architectures and when usedto construct on-disk suffix trees.recently, bedathur et al. developed a buffering strategy,called top-q, which improves the performance of theukkonens algorithm (which uses suffix links) when constructingon-disk suffix trees 4. a different approach wassuggested by hunt et al. 18 where the authors drop the useof suffix links and use an o(n2) algorithm with a better localityof reference. in one pass over the string, they indexall suffixes with the same prefix by inserting them into anon-disk subtree managed by pjama 3, a java based objectstore. construction of each independent subtree requires afull pass over the string.several o(n2) and o(n log n) algorithms for constructingsuffix trees are described in 17. a top-down approachhas been suggested in 1, 14, 16. in 15, the authors explorethe benefits of using a lazy implementation of suffixtrees. in this approach, the authors argue that one can avoidpaying the full construction cost by constructing the subtreeonly when it is accessed for the first time. this approachis useful only when a small number of queries are posedagainst a string dataset. when executing a large number ofqueries, most of the tree must be materialized, and in thiscase, this approach will perform poorly.previous research has also produced theoretical resultson understanding the average sizes of suffix trees 5, 30,and theoretical complexity of using sorting to build suffixtrees for different computational models such as ram,pram, and various other external memory models 12.suffix arrays have also been used as an alternative to suffixtrees for specific string matching tasks 8, 9, 26. however,in general, suffix trees are more versatile data structures.the focus of this paper is only on suffix trees.our solution uses a simple partitioning strategy. however,a more sophisticated partitioning method has beenproposed recently 6, which can complement our existingpartitioning method.3 the tdd techniquemost suffix tree construction algorithms do not scale dueto the prohibitive disk i/o requirements. the high percharacteroverhead quickly causes the data structures tooutgrow main memory and the poor locality of referencemakes efficient buffer management difficult.we now present a new disk-based construction techniquecalled the “top-down disk-based” technique, hereafterreferred to simply as tdd. tdd scales much moregracefully than existing techniques by reducing the mainmemoryrequirements through strategic buffering of thelargest data structures. the tdd technique consists of asuffix tree construction algorithm, called pwotd, and therelated buffer management strategy described in the followingsections.3.1 pwotd algorithmthe first component of the tdd technique is our suffixtree construction algorithm, called pwotd (partition andwrite only top down). this algorithm is based on the wotdeageralgorithm suggested by kurtz 15. we improve onthis algorithm by using a partitioning phase which allowsone to immediately build larger, independent sub-trees inmemory. before we explain the details of the algorithm,we briefly discuss the representation of the suffix tree.the suffix tree is represented by a linear array, as in wotdeager.this is a compact representation using an averageof 8.5 bytes per symbol indexed. figure 1 illustrates a suffixtree on the string attagtaca$ and the trees correspondingarray representation in memory. shaded entriesin the array represent leaf nodes, with all other entries representingnon-leaf nodes. an r in the lower right-hand cornerof an entry denotes a rightmost child. a branching nodeis represented by two integers. the first is an index into theinput string; the character at that index is the starting characterof the incoming edges label. the length of the labelcan be deduced by examining the children of the currentnode. the second entry points to the first child. note thatthe leaf nodes do not have a second entry. the leaf noderequires only the starting index of the label; the end of thelabel is the strings terminating character. see 15 for amore detailed explanation.the pwotd algorithm consists of two phases. inphase one, we partition the suffixes of the input string into|a|prefixlen partitions, where |a| is the alphabet size ofthe string and prefixlen is the depth of the partitioning. thepartitioning step is executed as follows. the input stringis scanned from left to right. at each index position i theprefixlen subsequent characters are used to determine oneof the |a|prefixlen partitions. this index i is then writtento the calculated partitions buffer. at the end of the scan,each partition will contain the suffix pointers for suffixesthat all have the same prefix of size prefixlen.to further illustrate the partition step, consider the followingexample. partitioning the string attagtaca$using a prefixlen of 1 would create four partitions of suffixes,one for each symbol in the alphabet. (we ignorethe final partition consisting of just the string terminatorsymbol $.) the suffix partition for the character a wouldalgorithm pwotd(string,prefixlen)phase1:scan the string and partition suffixes basedon the first prefixlen symbols of each suffixphase2: do for each partition:1. start buildsuffixtree2. populate suffixes from current partition3. sort suffixes on first symbol using temp4. output branching and leaf nodes to the tree5. push the nodes pointing to an unevaluated rangeonto the stackwhile stack is not empty6. pop a node7. find the longest common prefix (lcp) ofall the suffixes in this range by checkingthe string8. sort the range in suffixes on the firstsymbol using temp9. write out branching nodes or leaf nodes to tree10.push the nodes pointing to an unevaluated rangeonto the stack11. endfigure 2: the tdd algorithmbe 0,3,6,8, representing the suffixes attagtaca$,agtaca$, aca$, a$. the suffix partition for thecharacter t would be 1,2,5 representing the suffixesttagtaca$, tagtaca$, taca$. in phase two, weuse the wotdeager algorithm to build the suffix tree on eachpartition using a top down construction.the pseudo-code for the pwotd algorithm is shown infigure 2. while the partitioning in phase one of pwotd issimple enough, the algorithm for wotdeager in phase twowarrants further discussion. we now illustrate the wotdeageralgorithm using an example.3.1.1 example illustrating the wotdeager algorithmthe pwotd algorithm requires four data structures forconstructing suffix trees: an input string array, a suffix array,a temporary array, and the suffix tree. for the discussionthat follows, we name each of these structures string,suffixes, temp, and tree, respectively.the suffixes array is first populated with suffixes from apartition after discarding the first prefixlen characters. usingthe same example string as before, attagtaca$,consider the construction of the suffixes array for the tpartition.the suffixes in this partition are at positions 1,2, and 5. since all these suffixes share the same prefix, t,we add one to each offset to produce the new suffix array2,3,6. the next step involves sorting this array of suffixesbased on the first character. the first characters ofeach suffix are t, a, a. the sorting is done using anefficient algorithm called count-sort in linear time (for aconstant alphabet size). in a single pass, for each characterin the alphabet, we count the number of occurrences of thatcharacter in the first character of each suffix, and copy thesuffix pointers into the temp array. we see that the countfor a is 2 and the count for t is 1; the counts for g, c, and$ are 0. we can use these counts to determine the charactergroup boundaries: group a will start at position 0 with twoentries, and group t will start at position 2 with 1 entry. wemake a single pass through the temp array and produce thesuffixes array sorted on the first character. the suffixes arrayis now 3, 6, 2. the a-group has two members and istherefore a branching node. these two suffixes completelydetermine the sub-tree below this node. space is reservedin the tree to write this non-leaf node once it is expanded,then the node is pushed onto the stack. since the t-grouphas only one member, it is a leaf node and will be immediatelywritten to the tree. since no other children need tobe processed, no additional entries are added to the stack,and this node will be popped off first.once the node is popped off the stack, we find thelongest common prefix (lcp) of all the nodes in the group3, 6. we examine position 4 (g) and position 7 (c) todetermine that the lcp is 1. each suffix pointer is incrementedby the lcp, and the result is processed as before.the computation proceeds until all nodes have been expandedand the stack is empty. figure 1 shows the completesuffix tree and its array representation.3.1.2 discussion of the pwotd algorithmobserve that phase 2 of pwotd operates on subsets ofthe suffixes of the string. in wotdeager, for a string of nsymbols, the size of the suffixes array and the temp arrayneeded to be 4 n bytes (assuming 4 byte integers areused as pointers). by partitioning in phase 1, the amountof memory needed by the suffix arrays in each run is just4n|a|prefixlen . this is an important point: partitioning decreasesthe main-memory requirements for suffix tree construction,allowing independent sub-tree to be built entirelyin main memory. suppose we are partitioning a 100 millionsymbol string over an alphabet of size 4. using aprefixlen = 2 will decrease the space requirement of thesuffixes and temp arrays from 400 mb to 25 mb each, andthe tree array from 1200 mb to 75 mb. unfortunately, thissavings is not entirely free. the cost to partition increaseslinearly with prefixlen. for small input strings where wehave sufficient main memory for all the structures, we canskip the partitioning phase entirely. it is not necessary tocontinue partitioning once the suffixes and temp arrays fitinto memory. for even very large datasets, such as the humangenome, partitioning be
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- 学生宿舍安全管理与应急预案方案
- 水痘病人的护理
- 展示设计中的平面构成2024091783课件
- 水电安装安全知识培训课件
- 二零二五年度高品质地砖批量供货合作协议
- 二零二五年度房屋租赁合同违约金上诉状制作
- 2025版教育机构临时用工人员服务协议书
- 2025版新能源汽车短期租赁借车协议书
- 2025版在线教育平台课程订购及服务合同
- 2025版公共建筑照明设备更新改造合同
- 2025年医院血透室人员培训工作计划
- 2025年公务员考试时政热点必做题(附答案)
- 厨房刀具安全培训课件
- 护理烫伤不良事件分析及整改措施
- 执勤警示教育
- 供热企业运营管理制度
- 2025年外企德科人力资源服务招聘笔试参考题库含答案解析
- 生态环境综合整治工程项目投资估算
- CJ/T 341-2010混空轻烃燃气
- MSC:破解能源转型密码:中国清洁能源投资实践指南
- 存款代为保管协议书
评论
0/150
提交评论