The UCSC Table Browser data retrieval tool
Donna Karolchik
0
Angela S. Hinrichs
0
Terrence S. Furey
0
Krishna M. Roskin
0
Charles W. Sugnet
0
David Haussler
0
W. James Kent
0
0
Center for Biomolecular Science and Engineering, University of California Santa Cruz (UCSC), School of Engineering
, 1156 High Street, Santa Cruz,
CA 95064-1077, USA
The University of California Santa Cruz (UCSC) Table Browser (http://genome.ucsc.edu/cgi-bin/ hgText) provides text-based access to a large collection of genome assemblies and annotation data stored in the Genome Browser Database. A flexible alternative to the graphical-based Genome Browser, this tool offers an enhanced level of query support that includes restrictions based on field values, free-form SQL queries and combined queries on multiple tables. Output can be filtered to restrict the fields and lines returned, and may be organized into one of several formats, including a simple tabdelimited file that can be loaded into a spreadsheet or database as well as advanced formats that may be uploaded into the Genome Browser as custom annotation tracks. The Table Browser User's Guide located on the UCSC website provides instructions and detailed examples for constructing queries and configuring output.
-
The UCSC Table Browser data retrieval tool is built on top of
the Genome Browser Database, a set of MySQL relational
databases that each store sequence and annotation data for one
genome assembly (1). Tables within the databases may be
differentiated by whether the data are based on genomic
startstop coordinates or are independent of position.
Positional tables contain data associated with specific
locations in the genome, such as mRNA alignments, gene
predictions, cross-species alignments and various other
annotations. Each of the annotation tracks displayed in the
graphical Genome Browser is based on one or more positional
tables. Data associated with custom annotation tracks active
within the users Table Browser session are also available as
positional tables.
Non-positional tables contain data not tied to genomic
location, for example a table that correlates a Genethon
marker name with a Marshfield marker name. Some
nonpositional tables relate internal numeric mRNA IDs to
extended information such as author, tissue or keyword.
Other meta tables contain information about the structure of
the database itself or describe external files containing
sequence data.
Because of the large size of the data set stored in each
database, particular attention has been paid to maintaining
adequate interactive performance. The databases contain
optimizations to support range-based queries from the
Table Browser and Genome Browser. Smaller tables are
indexed on a few critical fields and the data are presorted prior
to loading into the database. With larger tables, the data are
separated by chromosome into smaller tables, and a binning
scheme is implemented on the larger chromosome tables.
The document http://genome.ucsc.edu/goldenPath/
gbdDescriptions.html contains a detailed description of the
database tables and fields, which are dumped weekly into
downloadable tab-delimited files.
In addition to the inclusion of the latest human and mouse
assemblies, the Genome Browser Database has expanded in
the past year to include rat, worm and a collection of species
targeted by the NISC Comparative Sequencing Program (2),
with plans to add support for several additional genomes in the
coming year.
Recently, the UCSC Genome Bioinformatics group has
placed considerable emphasis on comparative genome
analysis. The group has been active in the analysis of evolutionary
conservation and divergence among species (3,4),
phylogenetic analysis of rates of substitution (5) and multiple
species alignments. This research has resulted in the addition
of several new types of annotation data to the Genome
Browser Database.
The axtChain program written by Jim Kent produces
chained BLASTZ alignments between two species (6). This
alignment tool uses a gap scoring system that allows longer
gaps than traditional affine gap scoring systems and can also
tolerate gaps in both species simultaneously. Further
processing of the chained alignments with the chainNet program
outputs an alignment net that shows the best chain for every
part of the genome.
UCSC has also been collaborating closely with the Penn
State University Bioinformatics Group to produce three-way
multiple species alignments using Webb Millers multiz
program, which takes BLASTZ and axtBest alignments as
input (7,8).
Many research scientists are familiar with the UCSC
Genome Browser (9), the graphical interface to the Genome
The authors wish it to be known that, in their opinion, the first two authors should be regarded as joint First Authors
Browser Database that displays requested portions of genome
assemblies together with a series of aligned annotation
tracks. Despite its ease of use, situations exist in which a
graphical browser may not be the optimal tool for working
with genomic data. A user might wish to view the raw data or
examine the relationships between the tables underlying the
browser. It is often desirable to filter the display output with
greater restrictions than are offered by the Genome Browser,
or to output the data in a text-based format that can be
imported into other programs.
The UCSC Table Browserwhich may be accessed
directly at http://genome.ucsc.edu/cgi-bin/hgText or through
the Tables link on the UCSC Genome Bioinformatics home
page (http://genome.ucsc.edu)provides a powerful and
flexible alternative for querying and manipulating the
annotation tables within the Genome Browser Database. Using
Table Browser form-based or free-form queries, one may
quickly and easily extract subsets of the database, in many
cases eliminating the need to set up a local copy of the MySQL
database. By configuring the tools output options, the user can
generate a custom annotation track that may be automatically
added to the graphical browser session, or create a file in one
of several output formats that can be used as input into other
programs. The Table Browser can also display basic statistics
calculated over a selected subset of data.
BASIC DATA QUERIES
In its most basic form, the Table Browser can be used to
retrieve a specific subset of records from a table in a selected
genome assembly. The user specifies a position of interest
within the assembly (or the keyword genome to access data
from the entire assembly), selects a table, and then chooses the
Get all fields option. The Table Browser displays the query
results in a tab-delimited text format that can be easily
downloaded and imported into text editors, spreadsheets and
other databases, or may be further processed by the users own
scripts.
For example, a user who is examining alternative splicing in
the human genome might be interested in downloading the
indices of all mRNA sequences that align to a chromosomal
region containing a particular gene. One would set the
Table Br (...truncated)