ProbFAST: Probabilistic Functional Analysis System Tool
Israel T Silva
0
1
2
Ricardo ZN Vncio
0
2
Thiago YK Oliveira
0
1
2
Greice A Molfetta
0
1
2
Wilson A Silva Jr
0
1
2
0
Department of Genetics, Faculty of Medicine, University of Sao Paulo
,
Ribeirao Preto
,
Brazil
1
National Institute of Science and Technology in Stem Cell and Cell Therapy, Center for Cell Therapy and Regional Blood Center
,
Ribeirao Preto
,
Brazil
2
Department of Genetics, Faculty of Medicine, University of Sao Paulo
,
Ribeirao Preto
,
Brazil
Background: The post-genomic era has brought new challenges regarding the understanding of the organization and function of the human genome. Many of these challenges are centered on the meaning of differential gene regulation under distinct biological conditions and can be performed by analyzing the Multiple Differential Expression (MDE) of genes associated with normal and abnormal biological processes. Currently MDE analyses are limited to usual methods of differential expression initially designed for paired analysis. Results: We proposed a web platform named ProbFAST for MDE analysis which uses Bayesian inference to identify key genes that are intuitively prioritized by means of probabilities. A simulated study revealed that our method gives a better performance when compared to other approaches and when applied to public expression data, we demonstrated its flexibility to obtain relevant genes biologically associated with normal and abnormal biological processes. Conclusions: ProbFAST is a free accessible web-based application that enables MDE analysis on a global scale. It offers an efficient methodological approach for MDE analysis of a set of genes that are turned on and off related to functional information during the evolution of a tumor or tissue differentiation. ProbFAST server can be accessed at http:// gdm.fmrp.usp.br/probfast.
-
Background
Transcriptome analysis of a tissue or cell type has been
widely used since the development of methodological
approaches for the large-scale study of gene expression
such as SAGE [1], MPSS [2], Microarray [3]. The
nextgeneration sequencing technology has been adapted to
transcriptome analysis and the ability to accurately
measure mRNA signals must provide unprecedented impact
on gene expression analysis [4,5]. Thus, it is accepted that
high-throughput data represents the starting point to
predict further our understanding of molecular disorders
associated with the physiopathology of a given
phenotype.
The most classical application to the analysis of gene
expression focuses on the identification of genes
differentially expressed between two biological conditions. At
this stage, a large number of statistical tests is used for a
precise identification of candidate genes [6,7]. The
network of biological processes involved in the evolution of a
tumor or in tissue differentiation is extremely complex
and requires the development of mathematical models
for a simultaneous analysis of a set of genes in two or
more biological conditions. Analyses of this nature are
currently performed using standard methods designed
for paired analyses. Thus, it is highly necessary to develop
methods for analysis of multiple expression of a gene. We
shall define the approach in the current study as Multiple
Differential Expression (MDE).
An example of the application of MDE approach may
be illustrated by the following question: what genes have
shown an increasing level of expression in three libraries
(A, B and C) representing the stages (evolution) of a
tumor? To answer this question, the usual procedure
analyses couples of libraries separately and makes
conjunctions or disjunctions of the relations found, e.g. A > B
AND B > C. In fact, this analysis is traditionally used to
select any g gene with an expression profiles such as Ag >
Bg > Cg. In this type of paired analysis, the main problem
is the sensitivity and specificity of statistical tests used to
detect what genes are differentially expressed [8]. These
statistical measures are closely related to the concepts of
type I and type II errors and they are potentiated when
more than two biological conditions are analyzed
simultaneously. To address this shortfall, we introduced a
Bayesian model to compute the generalization of the
pairwise comparisons in order to perform MDE analysis.
It is a new probabilistic method for targeted gene
selection on two or more classes through an intuitive approach
involving a question formulation process, and a
probability linked to it. In summary, all genes in accordance with
the previously formulated question will be ordered on the
basis of the probability that the question is true.
We presented a web-based system named Probabilistic
Functional Analysis System Tool (ProbFAST) that
permits suitable MDE analysis on a global scale. This tool
differs from others [8-11] by permitting the investigator
to analyze the global gene expression in different
biological conditions using private and/or public data,
integrating it into a set of functional pieces of information
including Gene Ontology [12], KEGG [13] and Biocarta
[14]. Within this context, the tool becomes useful for the
disclosure of genes related to biological processes that are
active during the cell differentiation and growth, as well
as during organogenesis. ProbFAST is designed primarily
for sequencing-based data, including data from
next-generation sequencing technology.
Implementation
Design functionality
ProbFAST is a tool which uses the client-server
architecture [Additional file 1: Supplemental Figure 1]. The
backend consists of a set of MySQL [15] relational tables that
store functional information extracted from the KEGG,
BioCarta and Gene Ontology repositories. Furthermore,
all the expression data of Gene Expression Omnibus
(GEO) [16] generated by the counting technique are
stored, including 1,800 SAGE and MPSS libraries of
approximately 40 species. All databases are monthly
updated, ensuring the access to the most recent
information. The server side is composed of three main interfaces
that enable remote use with convenient data uploading
and result visualization features.
The analysis starts with a friendly interface for the
inclusion of the project name and parameters to the
preprocessing and upload of libraries (Figure 1). In the
upload process, two options are available: 1) import data
from GEO: a search interface allows displaying a list of
expression profile experiments related to organism and
keywords filter, and 2) the upload option to analyze a new
experiment that is not included in the GEO database. To
do that, the user needs to submit a file with a predefined
format (detailed information on file format is available at
the help page). The file may be uploaded compressed in
gz, zip, or rar format. The gene identifiers supported by
ProbFAST include NCBI ID, gene symbol, tag sequence
or Unigene accession.
After the submission, users must formulate question(s)
by a comprehensive frame box and define the parameters
for enrich (...truncated)