CRISPR-Cas生物信息学外文文献CRISPR-Cas bioinformatics_第1页
CRISPR-Cas生物信息学外文文献CRISPR-Cas bioinformatics_第2页
CRISPR-Cas生物信息学外文文献CRISPR-Cas bioinformatics_第3页
CRISPR-Cas生物信息学外文文献CRISPR-Cas bioinformatics_第4页
CRISPR-Cas生物信息学外文文献CRISPR-Cas bioinformatics_第5页
已阅读5页,还剩4页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

Contents lists available at ScienceDirect Methods journal homepage CRISPR Cas bioinformatics Omer S Alkhnbashia Tobias Meierc Alexander Mitrofanova Rolf Backofena b Bj rn Vo c aChair of Bioinformatics University of Freiburg Freiburg Germany bSignalling Research Centres BIOSS and CIBSS University of Freiburg Germany cComputational Biology Institute of Biochemical Engineering University of Stuttgart Stuttgart Germany A R T I C L E I N F O Keywords CRISPR Cas Bioinformatics CRISPR array Cas genes CRISPR leader crRNAs PAM Spacers Direct repeats Anti CRISPR proteins A B S T R A C T Clustered regularly interspaced short palindromic repeats CRISPR and their associated proteins Cas are es sential genetic elements in many archaeal and bacterial genomes playing a key role in a prokaryote adaptive immune system against invasive foreign elements In recent years the CRISPR Cas system has also been en gineered to facilitate target gene editing in eukaryotic genomes Bioinformatics played an essential role in the detection and analysis of CRISPR systems and here we review the bioinformatics based efforts that pushed the field of CRISPR Cas research further We discuss the bioinformatics tools that have been published over the last few years and finally present the most popular tools for the design of CRISPR Cas9 guides 1 Introduction Roughly 90 of archaeal and 48 of bacterial genomes harbor an adaptive defense system against invading genetic elements called CRISPR Cas ClusteredRegularlyInterspacedShortPalindromic Repeats CRISPR associated sequences 1 CRISPR Cas systems con sist of three parts first an array of repeats separated by unique se quences called spacers second a leader sequence upstream of the CRISPR array that contains the promoter and forms a single transcript together with the CRISPR array and third a set of associated cas genes 1 that encode the proteins required for subsequent processing of in formation contained within the array Fig 1 CRISPR Cas systems carry out three major functions 1 adaptation where a fragment of foreign genetic material the so called protospacer is selected and incorporated into the CRISPR array as a new spacer Fig 1 2 biogenesis of CRISPR RNAs crRNAs where the CRISPR array is transcribed and processed into mature crRNAs that consist of single spacers with flanking handles stemming from the repeat and 3 interference where invading genetic elements are degraded 2 4 by respective CRISPR associated nucleases that are guided by the crRNA Fig 1 The crRNA and the Cas proteins form highly specific RNP complexes such as Cmr Cas module RAMP 3 4 or Cascade 5 6 Both the adaptation and the interference process often require an ad ditional motif in the vicinity of the protospacer named Protospacer Adjacent Motif PAM 7 Fig 1 CRISPR Cas systems have been classified into two major classes at least six types and several subtypes based on the effector module organization 8 Class I CRISPR Cas systems utilize multi protein effector complexes whereas class II CRISPR Cas systems utilize single protein effectors The majority of CRISPR Cas systems consist of Class I systems found in both bacteria and archaea These systems consist of 6 10 Cas proteins acting together in large Cascade or Cmr complexes In contrast Class II systems are lightweight with only 2 4 Cas proteins and were thought to be ex clusive to bacteria until a recent metagenomic study 9 discovered Cas9 proteins encoded in archaeal genomes Due to their compact ar chitecture Class II CRISPR Cas systems have been facilitated for effi cient genome editing in eukaryotes 10 11 The most prominent system CRISPR Cas9 requires an additional RNA component the trans activating crRNA tracrRNA 12 which is partially com plementary to the crRNAs and works as an adapter to recruit crRNAs to the RNP complex Today usually synthetic fusions of crRNAs and tracrRNA so called guide RNAs gRNAs are used in genome editing applications Bioinformatics played a key role in the discovery of the CRISPR Cas system and the identification of new functions The signature repeat spacer architecture of CRISPR arrays was first described by Yoshizumi Ishino in 1987 13 Later aided by bioinformatics analyses Francisco Mojica showed that CRISPR arrays were not only present in Escherichia https doi org 10 1016 j ymeth 2019 07 013 Received 1 March 2019 Received in revised form 19 June 2019 Accepted 15 July 2019 Corresponding author E mail addresses alkhanbo informatik uni freiburg de O S Alkhnbashi tobias meier ibvt uni stuttgart de T Meier mitrofan informatik uni freiburg de A Mitrofanov backofen informatik uni freiburg de R Backofen bjoern voss ibvt uni stuttgart de B Vo Methods xxx xxxx xxx xxx 1046 2023 2019 Elsevier Inc All rights reserved Please cite this article as Omer S Alkhnbashi et al Methods https doi org 10 1016 j ymeth 2019 07 013 coli but also in most archaeal and many bacterial genomes 14 Sub sequent bioinformatics analyses revealed matches of spacers to bac teriophages which lead to the correct hypothesis that CRISPR Cas systems act as an acquired immune system 15 17 Later another bioinformatics study on spacer matches successfully predicted CRISPR Cas systems to target primarily DNA rather than RNA 18 Thus bioinformatics played a crucial role in the discovery of CRISPR systems as well as in the generation of initial functional hypotheses Computational studies within the field have progressed along two tracks with analyses of protein coding genes on the one hand and prediction of nucleic acid interactions on the other Since their initial discovery much effort has gone into the classification of CRISPR Cas systems which also led to the discovery of novel cas genes resulting in an iterative optimization of classification and annotation schemes 8 19 21 Each iteration spawned a new generation of biochemical studies that focused on the freshly discovered proteins and their role in CRISPR Cas immunity and lead to the discovery of new types e g the recently described Type IV V and VI systems 8 22 Today the clas sification of CRISPR systems is based on both the organization of cas genes and the architecture of the CRISPR array 8 Furthermore the number of known cas gene families increased from 4 to 27 accom panied by an ever growing list of accessory genes This fast growth indicates that we have so far only seen the tip of the iceberg A po tentially rich source of new cas gene families and CRISPR Cas types are metagenomic data sets There are many more interesting aspects of CRISPR Cas systems that can be studied with the help of tailored bioinformatics tools Among them are the identification of target sequences of the crRNAs predic tion of CRISPR leader sequences characterization of PAM motifs and CRISPR Cas9 guide design In the following sections we present an overview of tools for these and other purposes and discuss best prac tices for efficient CRISPR Cas analyses In Fig 2 we provide an overview of the computational methods the tasks they can be used for and possible analysis pathways Detailed descriptions of the individual tools follow in the upcoming sections 2 Prediction of CRISPR Cas systems In general the prediction of CRISPR Cas systems is based on the identification of cas genes and CRISPR arrays While cas genes can be predicted by classical protein homology search tools e g Pfam HMMer3 CRISPR arrays are far less conserved mainly as a result of spacer acquisition and therefore require different strategies The basic idea of all the methods presented below is to identify repetitive se quences that meet specific requirements of repeat length spacing si milarity or number 2 1 CRISPRFinder ii leader sequences differ between CRISPR systems from different species but may have high similarity for systems with the same DR in one species As a result of the first property overlaps between reads that exceed the length of a DR significantly indicate a common origin The second property yields that reads may form large overlaps if sequenced from leader regions of different CRISPR systems in the same species As long as leader se quences are not identical most of these can be resolved by not allowing mismatches during overlap identification Finally reads from the same CRISPR locus might be distributed over different clusters Here paired end read information is used to merge such clusters if sufficient read ends connect them For each of the resulting clusters reads are as sembled into contigs which are validated by a modified version of CRT MetaCRISPR is a command line tool and runs on UNIX systems such as Linux or MacOS and requires Bowtie 37 Genometools http genometools org and the Python package networkx https networkx github io to run Read identification is the computation ally most demanding step of the pipeline On a soil data set with 271 M reads and a read length of 150bp metaCRISPR used 13 6G of RAM and ran 108 min on four threads 2 9 CRISPRleader The leader sequence of a CRISPR array together with cas1 cas2 and cas4 plays a critical role in adaptation of the CRISPR system It carries the primary promoter for CRISPR transcription and marks the site where new spacers are primarily incorporated 38 43 Therefore de tection and characterization of CRISPR leader sequences is an im portant step in studying the adaptation phase CRISPR leaders are non coding DNA sequences that do not show strong conservation which could be detected by sequence alignment CRISPRleader 44 is an ef ficient approach to determine the boundaries of CRISPR leaders It is based on the observation that leader sequences of related CRISPR arrays are similar Related arrays can be identified by the similarity of their DRs and clustering of CRISPR array upstream regions based on this feature constitutes the first step of CRISPRleader In the second step conservation of leader sequences is analyzed using a string kernel ap proach see 45 46 for details that can capture more information than traditional sequence alignment and is especially capable of detecting a collection of local motifs The application of CRISPRleader to all known CRISPR arrays showed that the length distribution of leaders is different between archaea and bacteria and that leaderless arrays have a lower number of spacers presumably because they lost the capability to in corporate new spacers Interestingly they likely retained their inter ference function CRISPRleader is a stand alone tool implemented in Python It re ports a full annotation of the CRISPR array its strand orientation as well as conserved core leader boundaries that can be uploaded to any genome browser Additionally it reports a full annotation of the CRISPR array in reader friendly HTML format The results also include the strand orientation and the conserved core leader boundaries in a standardized BED format such that they can be uploaded to any genome browser 2 10 CRISPRtionary Spacer sequences are almost exclusively incorporated at the leader end of the CRISPR array Therefore the spacer arrangement reflects the infection history of the host and can serve as a genetic marker for molecular typing and evolutionary analyses of closely related strains For example the presence of identical spacers in the same CRISPR locus in different strains indicates shared ancestry CRISPRtionary 47 is a web basedtool availableathttps crispr i2bc paris saclay fr CRISPRcompar Dict Dict php that compares and annotates spacers from homologous CRISPR loci in distinct strains thus supporting phy logenetic analyses The user has to provide a fasta file containing se quences of the same CRISPR locus for each strain of interest as input Such a collection can also be prepared with the accompanying tool CRISPRcompar https crispr i2bc paris saclay fr CRISPRcompar CRISPRFinder is used to identify DRs and spacers and the latter are labeled by strain and position in the CRISPR array The positions reflect the order of acquisition i e position 1 is the spacer furthest away from the leader Finally CRISPRtionary generates i a spacer dictionary which contains a catalog of the spacers and their labels and ii a binary file where rows represent strains and columns spacers to vi sualize the spacer compositions of the strains 3 Databases related to CRISPR Cas 3 1 CRISPRdb ii the genomic sequence chromosome or plasmid iii the CRISPR locus which provides ID start and end positions creation date and modification date of the entry iv the ID length and consensus sequence of the direct repeats v the IDs lengths and sequences of the spacers Users can create their own private databases of CRISPR arrays and then compare them to the public data by BLAST InparalleltotheextendedCRISPR Caspredictiontool CRISPRCasFinder see above a new database called CRISPRCasdb was made available CRISPRCasdb combines information about CRISPR arrays and Cas annotations for 240 archaeal and 9242 bacterial gen omes CRISPRdbisavailableat http crispr i2bc paris saclay fr crispr andCRISPRCasdbathttps crisprcas i2bc paris saclay fr MainDb Index 3 2 CRISPRone CRISPRone 50 is a web resource for CRISPR Cas systems identified in 11 102 complete and 21 186 draft genomes Furthermore CRIS PRone has a special focus on false CRISPR arrays so called mock CRISPRs Additionally the website offers services for i the prediction of CRISPR arrays based on MetaCRT 51 ii identification of Cas proteins based on Hidden Markov Models and HMMER iii prediction O S Alkhnbashi et al Methods xxx xxxx xxx xxx 5 of the anti repeat which is complimentary to the consensus repeat and is part of the tracrRNA 3 3 Anti CRISPRdb The Anti CRISPRdb 52 is so far the only public database for anti CRISPR proteins Acr It consists of a manual collection of published anti CRISPR proteins and their homologs from public databases It provides the user with the ability to browse search screen download and share data on known and potential new anti CRISPR proteins This database contains 106 non redundant protein sequences grouped into six main families namely AcrID AcrIF AcrIE AcrIIA AcrIIC AcrVIB and 23 sub families including AcrIF1 10 AcrIE1 4 AcrIIA1 5 Acr IIC1 3 AcrID1 53 56 The database is accessible at http cefg uest 4 Classification Classification of CRISPR Cas systems is essential to illustrate the originsandevolutionofCRISPRlociinmicrobialgenomes Unfortunately the majority of cas genes have evolved rapidly compared to other genes in archaeal and bacterial genomes which impedes straightforward classification into distinct families 19 57 On the other hand CRISPR repeats have been directly classified into families capturing the full diversity of sequence and RNA secondary structure of CRISPR Cas systems 33 58 However although CRISPR repeat clas sification determines the evolution of CRISPR RNA there is no evidence shown for a one to one relationship between Cas proteins and the type of CRISPR repeat that is recognized by those proteins Therefore the classification of CRISPR Cas systems based on Cas proteins is essential 4 1 CRISPR Cas classification Owing to the very diverse nature of Cas protein sequences cas gene composition modularity and the interchangeability of Cas proteins 21 22 59 the classification of CRISPR Cas systems based on Cas proteins is an important and challenging issue In 2015 Makarova et al proposed a very comprehensive CRISPR Cas classification scheme which groups archaeal and bacterial CRISPR Cas systems into two major classes Class I which utilizes multi protein effector complexes and Class II which use single multidomain protein effectors 8 These classes are subdivided into six types which are further divided into nineteen subtypes The classification is based on combining features of the architecture of the Cas loci and the analysis of signature protein families Briefly the authors developed a library of 394 position spe cific scoring matrices PSSM from all protein families associated with CRISPR Cas systems Using a stringent similarity threshold these PSSMs were used to search for protein sequences from complete ar chaeal and bacterial genomes in the NCBI Reference Sequence Data base The candidate loci were extended up to 5 genes in both directions and checked for completeness Because of the fact that Cas genes in volved in adaptation are shared between different subtypes com pleteness refers to the effector genes only The complete loci were scored with multiple PSSMs for the signature genes of the different types and subtypes The final classification was based on a new simi larity measure between two sets of interference proteins It is defined as the average of the pairwise normalized similarity between all protein pairs across the two sets Clustering based on this similarity accurately reproduced the pre existing manual CRISPR classification Additionally a k nearest neighbor classifier based on this similarity resulted in only 4 out of 1942 loci being classified incorrectly yielding an accuracy of 0 998 In 2017 Shmakov et al 32 performed a comprehensive analysis of draft and complete archaeal and bacterial genomes along with meta genomic contigs and identified three new class II CRISPR Cas subtypes of type V all containing a RuvC like endonuclease domain and three subtypes of type VI each having two HEPN domains that were pre dicted to possess RNase activity Recently two studies 60 61 per formed a systematic comparison of genes functionally linked to CRISPR Cas systems which led to the identification and classification of new CRISPR associated gene families Usually their members are located in the vicinity of Type III CRISPR Cas interference gene cassettes The studies further showed that around half of the new accessory genes do not encode CRISPR associated Rossman Fold CARF domains and their function remained unknown In addition the authors observed that Non CARF accessory genes are more diverse than their CARF counter parts Finally due to the diversity of Non CARF genes the authors hypothesize that additional families remain to be

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论