CRISPRFinder: a web tool to identify clustered regularly interspaced short palindromic repeats (pdf)

Article PDF cannot be displayed. You can download it here:

https://nar.oxfordjournals.org/content/35/suppl_2/W52.full.pdf

CRISPRFinder: a web tool to identify clustered regularly interspaced short palindromic repeats

W52–W57 Nucleic Acids Research, 2007, Vol. 35, Web Server issue doi:10.1093/nar/gkm360 CRISPRFinder: a web tool to identify clustered regularly interspaced short palindromic repeats Ibtissem Grissa1,*, Gilles Vergnaud1,2 and Christine Pourcel1 1 Univ Paris-Sud, Institut de Génétique et Microbiologie, UMR 8621, Orsay, F-91405 and 2Centre d’Etude du Bouchet, 5 rue Lavoisier, 91710 Vert le Petit, France Received January 25, 2007; Revised April 6, 2007; Accepted April 25, 2007 ABSTRACT INTRODUCTION Genomic structures corresponding to CRISPRs were observed ﬁrst in 1987 in Escherichia coli (1) and were subsequently reported in other organisms under diﬀerent names [TREP (2), SRSR (3,4), DRVs (5), LCTR (6), SPIDR (7)] until the CRISPR acronym was proposed by Jansen et al. (8). The direct repeat sequences carry in general a low level of palindromic symmetry; they are remarkably well conserved within a species (up to 248 exact copies in Verminephrobacter eiseniae EF01-2). However, one of the ﬂanking DRs is frequently truncated or diverged (see Supplementary Data). The DR size varies from 24 to 47 bp whereas the spacer sequence is generally within the range of 0.6–2.5 the DR size. The originality of spacers is that they apparently derive from conjugative *To whom correspondence should be addressed. Tel: 33 1 69 15 30 01; Fax: 33 1 69 15 66 78; Email: ß 2007 The Author(s) This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/ by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited. Clustered regularly interspaced short palindromic repeats (CRISPRs) constitute a particular family of tandem repeats found in a wide range of prokaryotic genomes (half of eubacteria and almost all archaea). They consist of a succession of highly conserved regions (DR) varying in size from 23 to 47 bp, separated by similarly sized unique sequences (spacer) of usually viral origin. A CRISPR cluster is flanked on one side by an AT-rich sequence called the leader and assumed to be a transcriptional promoter. Recent studies suggest that this structure represents a putative RNA-interference-based immune system. Here we describe CRISPRFinder, a web service offering tools to (i) detect CRISPRs including the shortest ones (one or two motifs); (ii) define DRs and extract spacers; (iii) get the flanking sequences to determine the leader; (iv) blast spacers against Genbank database and (v) check if the DR is found elsewhere in prokaryotic sequenced genomes. CRISPRFinder is freely accessible at http://crispr.u-psud.fr/Server/ CRISPRfinder.php. plasmids or bacteriophages (2,9–11). A prokaryotic genome may harbour up to 16 CRISPR clusters with the same or a diﬀerent DR. In a genome, a single CRISPR is generally associated with a family of genes called cas for CRISPR-associated (8,12), encoding proteins showing functional similarity with components of the eukaryotic RNA interference (RNAi) systems (13). In addition, it was demonstrated in two archaea, Archaeoglobus fulgidus (14) and Sulfolobus solfataricus (15), that the CRISPR locus is transcribed into small RNAs (smRNA) probably from one of the ﬂanking regions, the leader, acting as a promoter. These observations and the viral origin of spacers have led to the hypothesis that the CRISPRassociated system (CASS) is a prokaryotic defence mechanism against genetic aggressions (10,13,16). Within species, CRISPRs may be present in a subset of strains, where they sometimes show polymorphism. The DR and the order of the spacers are well conserved, but the number of motifs (DR þ spacer) diﬀers from strain to strain. To better understand the mechanisms underlying the CRISPRs’ evolutionary scenario, three evolution rules were proposed by Pourcel et al. (10) and conﬁrmed by Lillestol et al. (15): (i) polarized acquisition of spacers near the leader sequence; (ii) random loss of motifs and (iii) shared ancestry when spacers are identical. CRISPRs’ in silico analyses started in 1995 (2) but no speciﬁc stand-alone CRISPR software tool was created. Several software were used by diﬀerent authors to identify these particular repeats but usually a manual discard of background was necessary, and generally some CRISPR clusters were missed or neglected, especially the shortest one (less than three motifs). This is the case, for example, of Tandem Repeat Finder (17) when considering a motif (DR þ spacer) as a degenerate repeat (10,18), or Locating Uniform poly-Nucleotide Areas (LUNA), a program for ﬁnding degenerate repeats in microbial genomes on a desktop computer. The repeats can be ﬁltered using several parameters including length, distance and level of conservation. LUNA was used especially for ﬁnding CRISPRs in archaea (4,15). Another program, Patscan (19) a pattern-matching tool that searches sequences ﬁtting the introduced pattern, was applied to Nucleic Acids Research, 2007, Vol. 35, Web Server issue W53 METHODS AND IMPLEMENTATION CRISPRFinder core routines were developed in Perl under Debian Linux. The input of the web tool is a genomic query sequence of length up to 67 Mb in ‘FASTA’ format. Possible locations of CRISPRs (consisting of at least one motif) are detected by ﬁnding maximal repeats. A maximal repeat (26) is a repeat that cannot be extended in either direction without incurring a mismatch. The total number of maximal repeats in a sequence of size n is linear (less than n) which is interesting since the computation may be done in linear time using a suﬃx-tree-based algorithm. A CRISPR pattern of two DRs and a spacer may be considered as a maximal repeat where the repeated sequences are separated by a sequence of approximately the same length. The operation of the program can be divided into four main steps summarized in Figure 1: (Step 1) browsing the maximal repeats of length 23–55 bp interspaced by sequences of 25–60 bp, (Step 2) selecting the DR consensus according to a deﬁned score taking into account the number of occurrences of the candidate DR in the whole genome and privileging internal mismatches between the DRs rather than mismatches in the ﬁrst or the last nucleotides, (Step 3) deﬁning candidate CRISPRs after checking if they ﬁt CRISPR deﬁnition, (Step 4) eliminating residual tandem repeats. In the ﬁrst step, maximal repeats are found by the software Vmatch (http://www.vmatch.de/), the upgrade of REPuter (22–24). Vmatch is based on a comprehensive implementation of enhanced suﬃx arrays (27) which provides the power of suﬃx trees with lower space requirements. A one nucleotide mismatch is allowed permitting minimal CRISPRs with a single nucleotide mutation between DRs to be found. Hereafter, the obtained maximal repeats are grouped to deﬁne regions of possible CRISPRs with a display of consensus DR candidates related to each cluster. The second st (...truncated)