TRedD—A database for tandem repeats over the edit distance (pdf)

Article PDF cannot be displayed. You can download it here:

https://academic.oup.com/database/article-pdf/doi/10.1093/database/baq003/1127106/baq003.pdf

TRedD—A database for tandem repeats over the edit distance

Database, Vol. 2010, Article ID baq003, doi:10.1093/database/baq003 ............................................................................................................................................................................................................................................................................................. Original article Dina Sokol1,* and Firat Atagun2,* 1 Department of Computer and Information Science, Brooklyn College of the City University of New York, 2900 Bedford Avenue, Brooklyn, NY 11210 and 2Department of Computer Science, The Graduate Center of the City University of New York, 365 Fifth Avenue, New York, NY 10016, USA *Corresponding author: Tel: +1 718 951 5000 (ext. 2065); Fax: +1 718 951 4842. Email: *Correspondence may also be addressed to Firat Atagun. Tel: +1 718 951 5657, Fax: +1 718 951 4842. Email: Submitted 18 October 2009; Revised 13 January 2010; Accepted 11 February 2010 ............................................................................................................................................................................................................................................................................................. A ‘tandem repeat’ in DNA is a sequence of two or more contiguous, approximate copies of a pattern of nucleotides. Tandem repeats are common in the genomes of both eukaryotic and prokaryotic organisms. They are significant markers for human identity testing, disease diagnosis, sequence homology and population studies. In this article, we describe a new database, TRedD, which contains the tandem repeats found in the human genome. The database is publicly available online, and the software for locating the repeats is also freely available. The definition of tandem repeats used by TRedD is a new and innovative definition based upon the concept of ‘evolutive tandem repeats’. In addition, we have developed a tool, called TandemGraph, to graphically depict the repeats occurring in a sequence. This tool can be coupled with any repeat finding software, and it should greatly facilitate analysis of results. Database URL: http://tandem.sci.brooklyn.cuny.edu/ ............................................................................................................................................................................................................................................................................................. Introduction A ‘tandem repeat’ in DNA is a sequence of two or more contiguous, approximate copies of a pattern of nucleotides. Tandem repeats are common in the genomes of both eukaryotic and prokaryotic organisms. They are significant markers for human identity testing, disease diagnosis, sequence homology and population studies. DNA consisting of tandem repeats is also called ‘satellite DNA’. Satellite DNA is usually classified among ‘satellites’ (spanning megabases of DNA), ‘minisatellites’ (repeat units in the range 10–60 bp, spanning 1–20 kb) and microsatellites (repeat units in the range 1–6 bp, spanning <150 bases). The minisatellites are also called ‘Variable Number Tandem Repeats’ or VNTRs and the microsatellites are often referred to as ‘Short Tandem Repeats’ or STRs. Tandem repeats are responsible for over 30 inherited diseases in humans. Expansions of simple DNA repeats have been linked to hereditary disorders in humans, including fragile X syndrome, myotonic dystrophy, Huntington’s disease, various spinocerebellar ataxias, Friedreich’s ataxia and others (1). These diseases are sometimes called the ‘repeat expansion diseases’ since they are caused by long and highly polymorphic tandem repeats (2, 3). The repeats in the human genome are the genetic markers used in DNA forensics (4). Since the number of adjacent repeated units varies from individual to individual, the copy number of a tandem repeat can be used to identify an individual, and relations such as parent or grandparent. Tandem repeats are also used in population studies (5), conservation biology (6) and in conjunction with multiple sequence alignments (7, 8). Tandem repeats are found in both coding and noncoding regions of DNA. Expansions of repeats found in the protein-coding portions of genes can affect the function of the gene by causing synthesis of malfunctioning proteins. Repeats in non-coding regions have been shown to affect biological processes by affecting gene expression, transcription and translation. ............................................................................................................................................................................................................................................................................................. ß The Author(s) 2010. Published by Oxford University Press. This is Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http:// creativecommons.org/licenses/by-nc/2.5), which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited. Page 1 of 10 (page number not for citation purposes) TRedD—A database for tandem repeats over the edit distance Original article Database, Vol. 2010, Article ID baq003, doi:10.1093/database/baq003 ............................................................................................................................................................................................................................................................................................. Preliminaries Although it is possible to use the TRedD database as is, it would be beneficial to understand the underlying definition of approximate tandem repeats that is used by the TRed software. In this section, we give a summary of the definition and the concepts of the algorithm used in TRed. Definition The definition of tandem repeats over the ‘edit distance’ uses the model of ‘evolutive tandem repeats’ (20). The model assumes that each copy of the repeat, from left to right, is derived from the previous copy through zero or more mutations. Thus, each copy in the repeat is similar to its predecessor and successor copy. The edit distance between two strings is defined as the minimum number of insertions, deletions and character replacements necessary to transform one string into another. Let edð,Þ denote the edit distance between two strings. Definition 1 A word r is a K-edit repeat if it can be partitioned into consecutive subwords, r ¼ v 0 w1 w2 . . . w‘ v 00 , ‘ 2, such that edðv 0 ,w10 Þþ P‘1 00 00 0 i¼1 edðwi ,wiþ1 Þ þ edðw‘ ,v Þ K, where w1 is some suffix 00 of w1 and w‘ is some prefix of w‘ . A K-edit repeat is a sequence of ‘evolving’ copies of a pattern such that there are at most K insertions, deletions and mismatches, overall, between all consecutive copies of the repeat. For example, the word r ¼ (...truncated)