TRedD—A database for tandem repeats over the edit distance
Database, Vol. 2010, Article ID baq003, doi:10.1093/database/baq003
.............................................................................................................................................................................................................................................................................................
Original article
Dina Sokol1,* and Firat Atagun2,*
1
Department of Computer and Information Science, Brooklyn College of the City University of New York, 2900 Bedford Avenue,
Brooklyn, NY 11210 and 2Department of Computer Science, The Graduate Center of the City University of New York, 365 Fifth Avenue,
New York, NY 10016, USA
*Corresponding author: Tel: +1 718 951 5000 (ext. 2065); Fax: +1 718 951 4842. Email:
*Correspondence may also be addressed to Firat Atagun. Tel: +1 718 951 5657, Fax: +1 718 951 4842. Email:
Submitted 18 October 2009; Revised 13 January 2010; Accepted 11 February 2010
.............................................................................................................................................................................................................................................................................................
A ‘tandem repeat’ in DNA is a sequence of two or more contiguous, approximate copies of a pattern of nucleotides.
Tandem repeats are common in the genomes of both eukaryotic and prokaryotic organisms. They are significant markers
for human identity testing, disease diagnosis, sequence homology and population studies. In this article, we describe a new
database, TRedD, which contains the tandem repeats found in the human genome. The database is publicly available
online, and the software for locating the repeats is also freely available. The definition of tandem repeats used by TRedD is
a new and innovative definition based upon the concept of ‘evolutive tandem repeats’. In addition, we have developed a
tool, called TandemGraph, to graphically depict the repeats occurring in a sequence. This tool can be coupled with any
repeat finding software, and it should greatly facilitate analysis of results.
Database URL: http://tandem.sci.brooklyn.cuny.edu/
.............................................................................................................................................................................................................................................................................................
Introduction
A ‘tandem repeat’ in DNA is a sequence of two or more
contiguous, approximate copies of a pattern of nucleotides.
Tandem repeats are common in the genomes of both
eukaryotic and prokaryotic organisms. They are significant
markers for human identity testing, disease diagnosis,
sequence homology and population studies.
DNA consisting of tandem repeats is also called ‘satellite
DNA’. Satellite DNA is usually classified among ‘satellites’
(spanning megabases of DNA), ‘minisatellites’ (repeat units
in the range 10–60 bp, spanning 1–20 kb) and microsatellites (repeat units in the range 1–6 bp, spanning <150
bases). The minisatellites are also called ‘Variable Number
Tandem Repeats’ or VNTRs and the microsatellites are
often referred to as ‘Short Tandem Repeats’ or STRs.
Tandem repeats are responsible for over 30 inherited
diseases in humans. Expansions of simple DNA repeats
have been linked to hereditary disorders in humans, including fragile X syndrome, myotonic dystrophy, Huntington’s
disease, various spinocerebellar ataxias, Friedreich’s ataxia
and others (1). These diseases are sometimes called the
‘repeat expansion diseases’ since they are caused by long
and highly polymorphic tandem repeats (2, 3).
The repeats in the human genome are the genetic markers used in DNA forensics (4). Since the number of adjacent repeated units varies from individual to individual, the
copy number of a tandem repeat can be used to identify an
individual, and relations such as parent or grandparent.
Tandem repeats are also used in population studies (5),
conservation biology (6) and in conjunction with multiple
sequence alignments (7, 8).
Tandem repeats are found in both coding and noncoding regions of DNA. Expansions of repeats found in
the protein-coding portions of genes can affect the function of the gene by causing synthesis of malfunctioning
proteins. Repeats in non-coding regions have been shown
to affect biological processes by affecting gene expression,
transcription and translation.
.............................................................................................................................................................................................................................................................................................
ß The Author(s) 2010. Published by Oxford University Press.
This is Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://
creativecommons.org/licenses/by-nc/2.5), which permits unrestricted non-commercial use, distribution, and reproduction in any medium,
provided the original work is properly cited.
Page 1 of 10
(page number not for citation purposes)
TRedD—A database for tandem repeats
over the edit distance
Original article
Database, Vol. 2010, Article ID baq003, doi:10.1093/database/baq003
.............................................................................................................................................................................................................................................................................................
Preliminaries
Although it is possible to use the TRedD database as is, it
would be beneficial to understand the underlying
definition of approximate tandem repeats that is used by
the TRed software. In this section, we give a summary of
the definition and the concepts of the algorithm used in
TRed.
Definition
The definition of tandem repeats over the ‘edit distance’
uses the model of ‘evolutive tandem repeats’ (20). The
model assumes that each copy of the repeat, from left to
right, is derived from the previous copy through zero or
more mutations. Thus, each copy in the repeat is similar
to its predecessor and successor copy.
The edit distance between two strings is defined as the
minimum number of insertions, deletions and character
replacements necessary to transform one string into
another. Let edð,Þ denote the edit distance between two
strings.
Definition 1 A word r is a K-edit repeat if it can be partitioned into consecutive subwords,
r ¼ v 0 w1 w2 . . . w‘ v 00 , ‘ 2, such that edðv 0 ,w10 Þþ
P‘1
00 00
0
i¼1 edðwi ,wiþ1 Þ þ edðw‘ ,v Þ K, where w1 is some suffix
00
of w1 and w‘ is some prefix of w‘ .
A K-edit repeat is a sequence of ‘evolving’ copies of a
pattern such that there are at most K insertions, deletions
and mismatches, overall, between all consecutive copies of
the repeat. For example, the word r ¼ (...truncated)