SEURAT: Visual analytics for the integrated analysis of microarray data
Alexander Gribov
0
3
Martin Sill
2
Sonja Lck
1
Frank Rcker
1
Konstanze Dhner
1
Lars Bullinger
1
Axel Benner
2
Antony Unwin
0
3
0
Department of Computer Oriented Statistics and Data Analysis, University of Augsburg
,
Universitatsstr. 14, 86159 Augsburg
,
Germany
1
Department of Internal Medicine III, University Hospital of Ulm
,
Albert-Einstein-Allee 23, D-89081 Ulm
,
Germany
2
Division of Biostatistics, German Cancer Research Center
,
Im Neuenheimer Feld 280, 69120 Heidelberg
,
Germany
3
Department of Computer Oriented Statistics and Data Analysis, University of Augsburg
,
Universitatsstr. 14, 86159 Augsburg
,
Germany
Background: In translational cancer research, gene expression data is collected together with clinical data and genomic data arising from other chip based high throughput technologies. Software tools for the joint analysis of such high dimensional data sets together with clinical data are required. Results: We have developed an open source software tool which provides interactive visualization capability for the integrated analysis of high-dimensional gene expression data together with associated clinical data, array CGH data and SNP array data. The different data types are organized by a comprehensive data manager. Interactive tools are provided for all graphics: heatmaps, dendrograms, barcharts, histograms, eventcharts and a chromosome browser, which displays genetic variations along the genome. All graphics are dynamic and fully linked so that any object selected in a graphic will be highlighted in all other graphics. For exploratory data analysis the software provides unsupervised data analytics like clustering, seriation algorithms and biclustering algorithms. Conclusions: The SEURAT software meets the growing needs of researchers to perform joint analysis of gene expression, genomical and clinical data.
-
Background
The rapid development of microarray technologies in
recent years has led to the possibility of acquiring a large
spectrum of different molecular data types. In
translational cancer research, gene expression data are usually
collected together with additional clinical information
and genomic data from other high throughput
technologies such as microarray-based comparative genomic
hybridization (array CGH) or SNP (single nucleotide
polymorphism) arrays. The availability of these related,
mostly high-dimensional data sets calls for software tools
which can analyze them all together in an integrated
fashion. Currently there is a lack of such applications that
enable exploratory analysis of integrated data sets. Most
visualization and clustering tools are limited in their
ability to handle gene expression, genomic and clinical data
together. To our knowledge only a few software tools are
able to perform an integrated analysis.
The VAMP software [1] is able to visualize genomic
gain and loss information together with gene expression
data. The focus of VAMP is on the comparison of the
genomic information between tumors and thus all data
types are displayed along the physical position in the
genome. It is not possible to reorder the gene expression
data according to the expression patterns and clustering
algorithms can only be applied to cluster different
tumors. A single graphic allows the display of additional
clinical data by a simple color code and this
representation is limited to categorical variables. In addition the
graphics are not linked, so that each graphic has to be
interpreted separately.
Other tools able to visualize gene expression data
together with genetic variations and other molecular data
types like RNAi data and methylation data are the
Integrative Genomic Viewer (IGV) [2] developed by the
Broad Institute and the Integrated Genome Browser [3].
These tools organize the different data types in the form
of tracks within a browser window similar to the well
known UCSC Genome Browser. The different data types
are displayed one below the other along the physical
positions of the genome. This visualization allows the user to
examine relations between different molecular data at
specific known genomic locations, but it is impossible to
reveal new trans-regulative relations. Furthermore, with
an increasing number of subjects and molecular data
types the comparison of the many tracks becomes
complicated. IGV additionally offers the possibility of aligning
clinical data using color codes. For continuous data and
especially for time to event data like survival times such a
representation is not sufficient.
Besides these open source software solutions, some
proprietary software tools are able to perform an
integrated analysis, e.g., the Genomic Workbench (Agilent
Technologies, Santa Clara, California) or Acuity
(Enterprise Microarray Informatics). However, although they
can handle the different data types, visualizations are
limited to stand alone graphics, not linked to other displays
such as clustering results or summary statistics of clinical
variables. In order to reveal new biologically meaningful
relations possibly hidden inside the different data sets, we
follow the philosophy of exploratory data analysis [4].
Our approach to this problem was to develop open
source software capable of performing in-depth
exploratory analyses with the help of interactive graphics. In
contrast to other software tools that usually aim to
visualize the information of the different data types within a
single graphic, we display each data type in its own
graphic and link them using interactive graphics. Each
graphic corresponds to the usual visualization of the
corresponding data type and can easily be interpreted.
Combining these dynamic graphics by linking, so that objects
selected are highlighted in all other graphics, and
providing unsupervised statistical methods enables users to
perform very effective exploratory analyses. The proposed
software does not compete with usual software
approaches that offer inferential statistics, but provides a
complementary analytical approach. The advantage of
our exploratory software regarding the analysis of
highdimensional integrated data sets is demonstrated by an
analysis of data collected from acute myeloid leukemia
(AML) patients.
Implementation
To ensure portability and platform independence,
SEURAT has been written in Java. Most of the GUI
elements are based on JAVA Swing packages so that
SEURAT has a uniform look and feel independent of the
underlying platform. The software establishes a
connection to the R statistical software [5] via Rserve [6]. Rserve
is a TCP/IP server which allows other programs to
communicate with R. This connection potentially provides
access to all functions implemented in R and
Bioconductor [7]. For clustering and seriation algorithms SEURAT
uses the facilities of the R-packages amap [8], seriation
[9] and biclust [10]. In order to use SEURAT, R, the
relevant R packages, and the Java Runtime Environment
(JRE) 1.6 need to be installed on the user's computer. The
softwar (...truncated)