SEURAT: Visual analytics for the integrated analysis of microarray data
Gribov et al. BMC Medical Genomics 2010, 3:21
http://www.biomedcentral.com/1755-8794/3/21
Open Access
SOFTWARE
SEURAT: Visual analytics for the integrated analysis
of microarray data
Software
Alexander Gribov†1, Martin Sill†2, Sonja Lück3, Frank Rücker3, Konstanze Döhner3, Lars Bullinger3, Axel Benner2 and
Antony Unwin*1
Abstract
Background: In translational cancer research, gene expression data is collected together with clinical data and
genomic data arising from other chip based high throughput technologies. Software tools for the joint analysis of such
high dimensional data sets together with clinical data are required.
Results: We have developed an open source software tool which provides interactive visualization capability for the
integrated analysis of high-dimensional gene expression data together with associated clinical data, array CGH data
and SNP array data. The different data types are organized by a comprehensive data manager. Interactive tools are
provided for all graphics: heatmaps, dendrograms, barcharts, histograms, eventcharts and a chromosome browser,
which displays genetic variations along the genome. All graphics are dynamic and fully linked so that any object
selected in a graphic will be highlighted in all other graphics. For exploratory data analysis the software provides
unsupervised data analytics like clustering, seriation algorithms and biclustering algorithms.
Conclusions: The SEURAT software meets the growing needs of researchers to perform joint analysis of gene
expression, genomical and clinical data.
Background
The rapid development of microarray technologies in
recent years has led to the possibility of acquiring a large
spectrum of different molecular data types. In translational cancer research, gene expression data are usually
collected together with additional clinical information
and genomic data from other high throughput technologies such as microarray-based comparative genomic
hybridization (array CGH) or SNP (single nucleotide
polymorphism) arrays. The availability of these related,
mostly high-dimensional data sets calls for software tools
which can analyze them all together in an integrated fashion. Currently there is a lack of such applications that
enable exploratory analysis of integrated data sets. Most
visualization and clustering tools are limited in their ability to handle gene expression, genomic and clinical data
together. To our knowledge only a few software tools are
able to perform an integrated analysis.
* Correspondence:
1 Department of Computer Oriented Statistics and Data Analysis, University of
Augsburg, Universitätsstr. 14, 86159 Augsburg, Germany
† Contributed equally
Full list of author information is available at the end of the article
The VAMP software [1] is able to visualize genomic
gain and loss information together with gene expression
data. The focus of VAMP is on the comparison of the
genomic information between tumors and thus all data
types are displayed along the physical position in the
genome. It is not possible to reorder the gene expression
data according to the expression patterns and clustering
algorithms can only be applied to cluster different
tumors. A single graphic allows the display of additional
clinical data by a simple color code and this representation is limited to categorical variables. In addition the
graphics are not linked, so that each graphic has to be
interpreted separately.
Other tools able to visualize gene expression data
together with genetic variations and other molecular data
types like RNAi data and methylation data are the Integrative Genomic Viewer (IGV) [2] developed by the
Broad Institute and the Integrated Genome Browser [3].
These tools organize the different data types in the form
of tracks within a browser window similar to the well
known UCSC Genome Browser. The different data types
are displayed one below the other along the physical positions of the genome. This visualization allows the user to
© 2010 Gribov et al; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons
Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in
any medium, provided the original work is properly cited.
Gribov et al. BMC Medical Genomics 2010, 3:21
http://www.biomedcentral.com/1755-8794/3/21
examine relations between different molecular data at
specific known genomic locations, but it is impossible to
reveal new trans-regulative relations. Furthermore, with
an increasing number of subjects and molecular data
types the comparison of the many tracks becomes complicated. IGV additionally offers the possibility of aligning
clinical data using color codes. For continuous data and
especially for time to event data like survival times such a
representation is not sufficient.
Besides these open source software solutions, some
proprietary software tools are able to perform an integrated analysis, e.g., the Genomic Workbench (Agilent
Technologies, Santa Clara, California) or Acuity (Enterprise Microarray Informatics). However, although they
can handle the different data types, visualizations are limited to stand alone graphics, not linked to other displays
such as clustering results or summary statistics of clinical
variables. In order to reveal new biologically meaningful
relations possibly hidden inside the different data sets, we
follow the philosophy of exploratory data analysis [4].
Our approach to this problem was to develop open
source software capable of performing in-depth exploratory analyses with the help of interactive graphics. In
contrast to other software tools that usually aim to visualize the information of the different data types within a
single graphic, we display each data type in its own
graphic and link them using interactive graphics. Each
graphic corresponds to the usual visualization of the corresponding data type and can easily be interpreted. Combining these dynamic graphics by linking, so that objects
selected are highlighted in all other graphics, and providing unsupervised statistical methods enables users to perform very effective exploratory analyses. The proposed
software does not compete with usual software
approaches that offer inferential statistics, but provides a
complementary analytical approach. The advantage of
our exploratory software regarding the analysis of highdimensional integrated data sets is demonstrated by an
analysis of data collected from acute myeloid leukemia
(AML) patients.
Implementation
To ensure portability and platform independence,
SEURAT has been written in Java. Most of the GUI elements are based on JAVA Swing packages so that
SEURAT has a uniform look and feel independent of the
underlying platform. The software establishes a connection to the R statistical software [5] via Rserve [6]. Rserve
is a TCP/IP server which allows other programs to communicate with R. This connecti (...truncated)