SEURAT: Visual analytics for the integrated analysis of microarray data (pdf)

Article PDF cannot be displayed. You can download it here:

http://www.biomedcentral.com/content/pdf/1755-8794-3-21.pdf

SEURAT: Visual analytics for the integrated analysis of microarray data

Alexander Gribov 0 3 Martin Sill 2 Sonja Lck 1 Frank Rcker 1 Konstanze Dhner 1 Lars Bullinger 1 Axel Benner 2 Antony Unwin 0 3 0 Department of Computer Oriented Statistics and Data Analysis, University of Augsburg , Universitatsstr. 14, 86159 Augsburg , Germany 1 Department of Internal Medicine III, University Hospital of Ulm , Albert-Einstein-Allee 23, D-89081 Ulm , Germany 2 Division of Biostatistics, German Cancer Research Center , Im Neuenheimer Feld 280, 69120 Heidelberg , Germany 3 Department of Computer Oriented Statistics and Data Analysis, University of Augsburg , Universitatsstr. 14, 86159 Augsburg , Germany Background: In translational cancer research, gene expression data is collected together with clinical data and genomic data arising from other chip based high throughput technologies. Software tools for the joint analysis of such high dimensional data sets together with clinical data are required. Results: We have developed an open source software tool which provides interactive visualization capability for the integrated analysis of high-dimensional gene expression data together with associated clinical data, array CGH data and SNP array data. The different data types are organized by a comprehensive data manager. Interactive tools are provided for all graphics: heatmaps, dendrograms, barcharts, histograms, eventcharts and a chromosome browser, which displays genetic variations along the genome. All graphics are dynamic and fully linked so that any object selected in a graphic will be highlighted in all other graphics. For exploratory data analysis the software provides unsupervised data analytics like clustering, seriation algorithms and biclustering algorithms. Conclusions: The SEURAT software meets the growing needs of researchers to perform joint analysis of gene expression, genomical and clinical data. - Background The rapid development of microarray technologies in recent years has led to the possibility of acquiring a large spectrum of different molecular data types. In translational cancer research, gene expression data are usually collected together with additional clinical information and genomic data from other high throughput technologies such as microarray-based comparative genomic hybridization (array CGH) or SNP (single nucleotide polymorphism) arrays. The availability of these related, mostly high-dimensional data sets calls for software tools which can analyze them all together in an integrated fashion. Currently there is a lack of such applications that enable exploratory analysis of integrated data sets. Most visualization and clustering tools are limited in their ability to handle gene expression, genomic and clinical data together. To our knowledge only a few software tools are able to perform an integrated analysis. The VAMP software [1] is able to visualize genomic gain and loss information together with gene expression data. The focus of VAMP is on the comparison of the genomic information between tumors and thus all data types are displayed along the physical position in the genome. It is not possible to reorder the gene expression data according to the expression patterns and clustering algorithms can only be applied to cluster different tumors. A single graphic allows the display of additional clinical data by a simple color code and this representation is limited to categorical variables. In addition the graphics are not linked, so that each graphic has to be interpreted separately. Other tools able to visualize gene expression data together with genetic variations and other molecular data types like RNAi data and methylation data are the Integrative Genomic Viewer (IGV) [2] developed by the Broad Institute and the Integrated Genome Browser [3]. These tools organize the different data types in the form of tracks within a browser window similar to the well known UCSC Genome Browser. The different data types are displayed one below the other along the physical positions of the genome. This visualization allows the user to examine relations between different molecular data at specific known genomic locations, but it is impossible to reveal new trans-regulative relations. Furthermore, with an increasing number of subjects and molecular data types the comparison of the many tracks becomes complicated. IGV additionally offers the possibility of aligning clinical data using color codes. For continuous data and especially for time to event data like survival times such a representation is not sufficient. Besides these open source software solutions, some proprietary software tools are able to perform an integrated analysis, e.g., the Genomic Workbench (Agilent Technologies, Santa Clara, California) or Acuity (Enterprise Microarray Informatics). However, although they can handle the different data types, visualizations are limited to stand alone graphics, not linked to other displays such as clustering results or summary statistics of clinical variables. In order to reveal new biologically meaningful relations possibly hidden inside the different data sets, we follow the philosophy of exploratory data analysis [4]. Our approach to this problem was to develop open source software capable of performing in-depth exploratory analyses with the help of interactive graphics. In contrast to other software tools that usually aim to visualize the information of the different data types within a single graphic, we display each data type in its own graphic and link them using interactive graphics. Each graphic corresponds to the usual visualization of the corresponding data type and can easily be interpreted. Combining these dynamic graphics by linking, so that objects selected are highlighted in all other graphics, and providing unsupervised statistical methods enables users to perform very effective exploratory analyses. The proposed software does not compete with usual software approaches that offer inferential statistics, but provides a complementary analytical approach. The advantage of our exploratory software regarding the analysis of highdimensional integrated data sets is demonstrated by an analysis of data collected from acute myeloid leukemia (AML) patients. Implementation To ensure portability and platform independence, SEURAT has been written in Java. Most of the GUI elements are based on JAVA Swing packages so that SEURAT has a uniform look and feel independent of the underlying platform. The software establishes a connection to the R statistical software [5] via Rserve [6]. Rserve is a TCP/IP server which allows other programs to communicate with R. This connection potentially provides access to all functions implemented in R and Bioconductor [7]. For clustering and seriation algorithms SEURAT uses the facilities of the R-packages amap [8], seriation [9] and biclust [10]. In order to use SEURAT, R, the relevant R packages, and the Java Runtime Environment (JRE) 1.6 need to be installed on the user's computer. The softwar (...truncated)