Automatic Peak Selection by a Benjamini-Hochberg-Based Algorithm (pdf)

Article PDF cannot be displayed. You can download it here:

https://journals.plos.org/plosone/article/file?id=10.1371/journal.pone.0053112&type=printable

Automatic Peak Selection by a Benjamini-Hochberg-Based Algorithm

Citation: Abbas A, Kong X-B, Liu Z, Jing B-Y, Gao X ( Automatic Peak Selection by a Benjamini-Hochberg- Based Algorithm Ahmed Abbas 0 Xin-Bing Kong 0 Zhi Liu 0 Bing-Yi Jing 0 Xin Gao 0 Anna Tramontano, University of Rome, Italy 0 1 Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology , Thuwal , Saudi Arabia , 2 Department of Statistics, Fudan University , Shanghai , China , 3 Department of Mathematics, Faculty of Science and Technology, University of Macau, Taipa, Macau, 4 Department of Mathematics, Hong Kong University of Science and Technology , Kowloon , Hong Kong A common issue in bioinformatics is that computational methods often generate a large number of predictions sorted according to certain confidence scores. A key problem is then determining how many predictions must be selected to include most of the true predictions while maintaining reasonably high precision. In nuclear magnetic resonance (NMR)based protein structure determination, for instance, computational peak picking methods are becoming more and more common, although expert-knowledge remains the method of choice to determine how many peaks among thousands of candidate peaks should be taken into consideration to capture the true peaks. Here, we propose a Benjamini-Hochberg (BH)-based approach that automatically selects the number of peaks. We formulate the peak selection problem as a multiple testing problem. Given a candidate peak list sorted by either volumes or intensities, we first convert the peaks into p-values and then apply the B-H-based algorithm to automatically select the number of peaks. The proposed approach is tested on the state-of-the-art peak picking methods, including WaVPeak [1] and PICKY [2]. Compared with the traditional fixed number-based approach, our approach returns significantly more true peaks. For instance, by combining WaVPeak or PICKY with the proposed method, the missing peak rates are on average reduced by 20% and 26%, respectively, in a benchmark set of 32 spectra extracted from eight proteins. The consensus of the B-H-selected peaks from both WaVPeak and PICKY achieves 88% recall and 83% precision, which significantly outperforms each individual method and the consensus method without using the B-H algorithm. The proposed method can be used as a standard procedure for any peak picking method and straightforwardly applied to some other prediction selection problems in bioinformatics. The source code, documentation and example data of the proposed method is available at http://sfb.kaust.edu.sa/pages/software.aspx. - Funding: This work was supported by Award No. GRP-CF-2011-19-P-Gao-Huang, a GMSV-OCRF award from King Abdullah University of Science and Technology, and Hong Kong Research Grants Council grants HKUST6019/10P and HKUST6019/12P. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Competing Interests: The authors have declared that no competing interests exist. Many computational bioinformatics methods generate a large number of predictions for the correct solution to a problem among which are both true and false predictions. Such predictions are usually sorted according to certain confidence scores. For instance, ab initio protein structure prediction methods sample tens of thousands of three-dimensional models. The energy values are calculated for each model based on a given energy function, where lower values likely indicate better models. Another example is the protein function annotation problem in which the amino acid sequence or the domain architecture of a protein is given and the Gene Ontology (GO) terms selected from among some 30,000 are used to annotate the function. In nuclear magnetic resonance (NMR)-based protein structure determination, thousands of peaks are routinely predicted from the input spectra in which there are usually tens to hundreds of true signals. The peaks are sorted according to either their intensities or estimated volumes. Both means of sorting, based on computational methods, have common properties. First, a large number of predictions are generated. Second, the predictions are scored by the scoring functions of the methods. However, the scoring functions are not powerful enough to distinguish true predictions from the false ones. Third, it is important to discover most of the true predictions while maintaining a reasonably low false positive rate. Therefore, it is crucial to know how many predictions should be selected in such scenarios. Peak picking is one of the key problems in NMR protein structure determination process [35]. The problem is defined as follows: given any NMR spectrum or a set of spectra, select the true signals, i.e., peaks, while filtering the false ones. Typically, true peaks are assumed to have Gaussian-like shapes and high intensities so that they can be easily differentiated from false ones. However, there are two main factors that make the peak picking problem difficult. On the one hand, depending on the quality of the protein sample, the property of the target protein and local dynamics, there can be a number of weak peaks, i.e., peaks with low intensities or volumes. That is, if we sort the predicted peaks by volumes or intensities, there is no clear cutoff threshold to distinguish true peaks from false ones. These peaks are difficult to identify even by manual processes. This is why computational methods are useful. On the other hand, due to the various sources of noise in NMR spectra, such as water bands and artifacts, false peaks can have high intensities or volumes. The group of sorted peaks is therefore comprised of a mixture of true peaks and false ones, where most of the true peaks tend to be ranked higher with a few strong, false peaks also included. It is extremely difficult, if not impossible, to select only the true peaks and eliminate all the false ones. In NMR structure determination, a missing true peak may cause all the follow-up procedures to fail, whereas a false peak can still be eliminated later [69]. Therefore, an ideal method should identify almost all the true peaks while maintaining reasonably high precision. The peak picking problem has been studied for more than two decades. A variety of computational methods have been proposed [1,2,1019]. The existing methods can be classified into two categories according to the de-noising method. Included in the first category are hard threshold-based approaches. For instance, PICKY [2] assumes that the noise is white Gaussian and estimates the noise level in small regions that do not contain signals. The data points that have lower intensities than the estimated noise level are eliminated from the spectra. Singular value decomposition is applied to the connected components of the remainder of the spectra to yield one-dimensional lineshapes. The peaks are identified in each lineshap (...truncated)