Automatic Peak Selection by a Benjamini-Hochberg-Based Algorithm
Citation: Abbas A, Kong X-B, Liu Z, Jing B-Y, Gao X (
Automatic Peak Selection by a Benjamini-Hochberg- Based Algorithm
Ahmed Abbas 0
Xin-Bing Kong 0
Zhi Liu 0
Bing-Yi Jing 0
Xin Gao 0
Anna Tramontano, University of Rome, Italy
0 1 Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology , Thuwal , Saudi Arabia , 2 Department of Statistics, Fudan University , Shanghai , China , 3 Department of Mathematics, Faculty of Science and Technology, University of Macau, Taipa, Macau, 4 Department of Mathematics, Hong Kong University of Science and Technology , Kowloon , Hong Kong
A common issue in bioinformatics is that computational methods often generate a large number of predictions sorted according to certain confidence scores. A key problem is then determining how many predictions must be selected to include most of the true predictions while maintaining reasonably high precision. In nuclear magnetic resonance (NMR)based protein structure determination, for instance, computational peak picking methods are becoming more and more common, although expert-knowledge remains the method of choice to determine how many peaks among thousands of candidate peaks should be taken into consideration to capture the true peaks. Here, we propose a Benjamini-Hochberg (BH)-based approach that automatically selects the number of peaks. We formulate the peak selection problem as a multiple testing problem. Given a candidate peak list sorted by either volumes or intensities, we first convert the peaks into p-values and then apply the B-H-based algorithm to automatically select the number of peaks. The proposed approach is tested on the state-of-the-art peak picking methods, including WaVPeak [1] and PICKY [2]. Compared with the traditional fixed number-based approach, our approach returns significantly more true peaks. For instance, by combining WaVPeak or PICKY with the proposed method, the missing peak rates are on average reduced by 20% and 26%, respectively, in a benchmark set of 32 spectra extracted from eight proteins. The consensus of the B-H-selected peaks from both WaVPeak and PICKY achieves 88% recall and 83% precision, which significantly outperforms each individual method and the consensus method without using the B-H algorithm. The proposed method can be used as a standard procedure for any peak picking method and straightforwardly applied to some other prediction selection problems in bioinformatics. The source code, documentation and example data of the proposed method is available at http://sfb.kaust.edu.sa/pages/software.aspx.
-
Funding: This work was supported by Award No. GRP-CF-2011-19-P-Gao-Huang, a GMSV-OCRF award from King Abdullah University of Science and Technology,
and Hong Kong Research Grants Council grants HKUST6019/10P and HKUST6019/12P. The funders had no role in study design, data collection and analysis,
decision to publish, or preparation of the manuscript.
Competing Interests: The authors have declared that no competing interests exist.
Many computational bioinformatics methods generate a large
number of predictions for the correct solution to a problem among
which are both true and false predictions. Such predictions are
usually sorted according to certain confidence scores. For instance,
ab initio protein structure prediction methods sample tens of
thousands of three-dimensional models. The energy values are
calculated for each model based on a given energy function, where
lower values likely indicate better models. Another example is the
protein function annotation problem in which the amino acid
sequence or the domain architecture of a protein is given and the
Gene Ontology (GO) terms selected from among some 30,000 are
used to annotate the function.
In nuclear magnetic resonance (NMR)-based protein structure
determination, thousands of peaks are routinely predicted from the
input spectra in which there are usually tens to hundreds of true
signals. The peaks are sorted according to either their intensities or
estimated volumes. Both means of sorting, based on computational
methods, have common properties. First, a large number of
predictions are generated. Second, the predictions are scored by
the scoring functions of the methods. However, the scoring
functions are not powerful enough to distinguish true predictions
from the false ones. Third, it is important to discover most of the
true predictions while maintaining a reasonably low false positive
rate. Therefore, it is crucial to know how many predictions should
be selected in such scenarios.
Peak picking is one of the key problems in NMR protein
structure determination process [35]. The problem is defined as
follows: given any NMR spectrum or a set of spectra, select the
true signals, i.e., peaks, while filtering the false ones. Typically, true
peaks are assumed to have Gaussian-like shapes and high
intensities so that they can be easily differentiated from false ones.
However, there are two main factors that make the peak picking
problem difficult. On the one hand, depending on the quality of
the protein sample, the property of the target protein and local
dynamics, there can be a number of weak peaks, i.e., peaks with
low intensities or volumes. That is, if we sort the predicted peaks
by volumes or intensities, there is no clear cutoff threshold to
distinguish true peaks from false ones. These peaks are difficult to
identify even by manual processes. This is why computational
methods are useful. On the other hand, due to the various sources
of noise in NMR spectra, such as water bands and artifacts, false
peaks can have high intensities or volumes. The group of sorted
peaks is therefore comprised of a mixture of true peaks and false
ones, where most of the true peaks tend to be ranked higher with a
few strong, false peaks also included. It is extremely difficult, if not
impossible, to select only the true peaks and eliminate all the false
ones. In NMR structure determination, a missing true peak may
cause all the follow-up procedures to fail, whereas a false peak can
still be eliminated later [69]. Therefore, an ideal method should
identify almost all the true peaks while maintaining reasonably
high precision.
The peak picking problem has been studied for more than two
decades. A variety of computational methods have been proposed
[1,2,1019]. The existing methods can be classified into two
categories according to the de-noising method. Included in the first
category are hard threshold-based approaches. For instance,
PICKY [2] assumes that the noise is white Gaussian and estimates
the noise level in small regions that do not contain signals. The
data points that have lower intensities than the estimated noise
level are eliminated from the spectra. Singular value
decomposition is applied to the connected components of the remainder of
the spectra to yield one-dimensional lineshapes. The peaks are
identified in each lineshap (...truncated)