CRISPRFinder: a web tool to identify clustered regularly interspaced short palindromic repeats
W52–W57 Nucleic Acids Research, 2007, Vol. 35, Web Server issue
doi:10.1093/nar/gkm360
CRISPRFinder: a web tool to identify clustered
regularly interspaced short palindromic repeats
Ibtissem Grissa1,*, Gilles Vergnaud1,2 and Christine Pourcel1
1
Univ Paris-Sud, Institut de Génétique et Microbiologie, UMR 8621, Orsay, F-91405 and 2Centre d’Etude du
Bouchet, 5 rue Lavoisier, 91710 Vert le Petit, France
Received January 25, 2007; Revised April 6, 2007; Accepted April 25, 2007
ABSTRACT
INTRODUCTION
Genomic structures corresponding to CRISPRs were
observed first in 1987 in Escherichia coli (1) and were
subsequently reported in other organisms under different
names [TREP (2), SRSR (3,4), DRVs (5), LCTR (6),
SPIDR (7)] until the CRISPR acronym was proposed by
Jansen et al. (8). The direct repeat sequences carry in
general a low level of palindromic symmetry; they are
remarkably well conserved within a species (up to 248
exact copies in Verminephrobacter eiseniae EF01-2).
However, one of the flanking DRs is frequently truncated
or diverged (see Supplementary Data). The DR size varies
from 24 to 47 bp whereas the spacer sequence is generally
within the range of 0.6–2.5 the DR size. The originality
of spacers is that they apparently derive from conjugative
*To whom correspondence should be addressed. Tel: 33 1 69 15 30 01; Fax: 33 1 69 15 66 78; Email:
ß 2007 The Author(s)
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/
by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
Clustered regularly interspaced short palindromic
repeats (CRISPRs) constitute a particular family of
tandem repeats found in a wide range of prokaryotic
genomes (half of eubacteria and almost all archaea).
They consist of a succession of highly conserved
regions (DR) varying in size from 23 to 47 bp,
separated by similarly sized unique sequences
(spacer) of usually viral origin. A CRISPR cluster is
flanked on one side by an AT-rich sequence called
the leader and assumed to be a transcriptional
promoter. Recent studies suggest that this structure represents a putative RNA-interference-based
immune system. Here we describe CRISPRFinder, a
web service offering tools to (i) detect CRISPRs
including the shortest ones (one or two motifs);
(ii) define DRs and extract spacers; (iii) get the
flanking sequences to determine the leader;
(iv) blast spacers against Genbank database and
(v) check if the DR is found elsewhere in
prokaryotic sequenced genomes. CRISPRFinder is
freely accessible at http://crispr.u-psud.fr/Server/
CRISPRfinder.php.
plasmids or bacteriophages (2,9–11). A prokaryotic
genome may harbour up to 16 CRISPR clusters with the
same or a different DR. In a genome, a single CRISPR is
generally associated with a family of genes called cas for
CRISPR-associated (8,12), encoding proteins showing
functional similarity with components of the eukaryotic
RNA interference (RNAi) systems (13). In addition, it was
demonstrated in two archaea, Archaeoglobus fulgidus (14)
and Sulfolobus solfataricus (15), that the CRISPR locus
is transcribed into small RNAs (smRNA) probably
from one of the flanking regions, the leader, acting as a
promoter. These observations and the viral origin
of spacers have led to the hypothesis that the CRISPRassociated system (CASS) is a prokaryotic defence
mechanism against genetic aggressions (10,13,16). Within
species, CRISPRs may be present in a subset of strains,
where they sometimes show polymorphism. The DR and
the order of the spacers are well conserved, but the
number of motifs (DR þ spacer) differs from strain to
strain. To better understand the mechanisms underlying
the CRISPRs’ evolutionary scenario, three evolution rules
were proposed by Pourcel et al. (10) and confirmed by
Lillestol et al. (15): (i) polarized acquisition of spacers
near the leader sequence; (ii) random loss of motifs and
(iii) shared ancestry when spacers are identical.
CRISPRs’ in silico analyses started in 1995 (2) but no
specific stand-alone CRISPR software tool was created.
Several software were used by different authors to identify
these particular repeats but usually a manual discard
of background was necessary, and generally some
CRISPR clusters were missed or neglected, especially the
shortest one (less than three motifs). This is the case, for
example, of Tandem Repeat Finder (17) when considering
a motif (DR þ spacer) as a degenerate repeat (10,18), or
Locating Uniform poly-Nucleotide Areas (LUNA),
a program for finding degenerate repeats in microbial
genomes on a desktop computer. The repeats can be
filtered using several parameters including length, distance
and level of conservation. LUNA was used especially for
finding CRISPRs in archaea (4,15). Another program,
Patscan (19) a pattern-matching tool that searches
sequences fitting the introduced pattern, was applied to
Nucleic Acids Research, 2007, Vol. 35, Web Server issue W53
METHODS AND IMPLEMENTATION
CRISPRFinder core routines were developed in Perl
under Debian Linux. The input of the web tool is a
genomic query sequence of length up to 67 Mb in
‘FASTA’ format. Possible locations of CRISPRs (consisting of at least one motif) are detected by finding
maximal repeats. A maximal repeat (26) is a repeat that
cannot be extended in either direction without incurring a
mismatch. The total number of maximal repeats in a
sequence of size n is linear (less than n) which is interesting
since the computation may be done in linear time using a
suffix-tree-based algorithm. A CRISPR pattern of two
DRs and a spacer may be considered as a maximal repeat
where the repeated sequences are separated by a sequence
of approximately the same length.
The operation of the program can be divided into four
main steps summarized in Figure 1: (Step 1) browsing the
maximal repeats of length 23–55 bp interspaced by
sequences of 25–60 bp, (Step 2) selecting the DR consensus
according to a defined score taking into account the
number of occurrences of the candidate DR in the whole
genome and privileging internal mismatches between the
DRs rather than mismatches in the first or the last
nucleotides, (Step 3) defining candidate CRISPRs after
checking if they fit CRISPR definition, (Step 4) eliminating residual tandem repeats.
In the first step, maximal repeats are found by the
software Vmatch (http://www.vmatch.de/), the upgrade of
REPuter (22–24). Vmatch is based on a comprehensive
implementation of enhanced suffix arrays (27) which
provides the power of suffix trees with lower space
requirements. A one nucleotide mismatch is allowed
permitting minimal CRISPRs with a single nucleotide
mutation between DRs to be found. Hereafter, the
obtained maximal repeats are grouped to define regions
of possible CRISPRs with a display of consensus DR
candidates related to each cluster.
The second st (...truncated)