MpBsmi: A new algorithm for the recognition of continuous biological sequence pattern based on index structure

PLOS ONE, Nov 2019

A significant approach for the discovery of biological regulatory rules of genes, protein and their inheritance relationships is the extraction of meaningful patterns from biological sequence data. The existing algorithms of sequence pattern discovery, like MSPM and FBSB, suffice their low efficiency and accuracy. In order to deal with this issue, this paper presents a new algorithm for biological sequence pattern mining abbreviated MpBsmi based on the data index structure. The MpBsmi algorithm employs a sequence position table abbreviated ST and a sequence database index structure named DB-Index for data storing, mining and pattern expansion. The ST and DB-Index of single items are firstly obtained through scanning sequence database once. Then a new algorithm for fast support counting is developed to mine the table ST to identify the frequent single items. Based on a connection strategy, the frequent patterns are expanded and the expanded table ST is updated by scanning the DB-Index. The fast support counting algorithm is used for obtaining the frequent expansion patterns. Finally, a new pruning technique is developed for extended pattern to avoid the generation of unnecessarily large number of candidate patterns. The experiments results on multiple classical protein sequences from the Pfam database validate the performance of the proposed algorithm including the accuracy, stability and scalability. It is showed that the proposed algorithm has achieved the better space efficiency, stability and scalability comparing with MSPM, FBSB which are the two main algorithms for biological sequence mining.

A PDF file should load here. If you do not see its contents the file may be temporarily unavailable at the journal website or you do not have a PDF plug-in installed and enabled in your browser.

Alternatively, you can download the file locally and open with any standalone PDF reader:

https://journals.plos.org/plosone/article/file?id=10.1371/journal.pone.0195601&type=printable

MpBsmi: A new algorithm for the recognition of continuous biological sequence pattern based on index structure

April MpBsmi: A new algorithm for the recognition of continuous biological sequence pattern based on index structure Weina Li 0 1 2 Jiadong Ren 0 1 2 0 Data Availability Statement: Pfam 31.0. Http:// pfam.xfam.org/family/browse 1 College of Information Science and Engineering, Yanshan University , Qinhuangdao, Hebei P.R.China, 066004 , 2 The Key Laboratory for Computer Virtual Technology and System Integration of Hebei Province , Qinhuangdao City , P.R. China, 066004 , 3 The Key Laboratory for Software Engineering of Hebei Province , Qinhuangdao City , P.R. China, 066004 2 Editor: Xiangtao Li, Northeast Normal University , CHINA A significant approach for the discovery of biological regulatory rules of genes, protein and their inheritance relationships is the extraction of meaningful patterns from biological sequence data. The existing algorithms of sequence pattern discovery, like MSPM and FBSB, suffice their low efficiency and accuracy. In order to deal with this issue, this paper presents a new algorithm for biological sequence pattern mining abbreviated MpBsmi based on the data index structure. The MpBsmi algorithm employs a sequence position table abbreviated ST and a sequence database index structure named DB-Index for data storing, mining and pattern expansion. The ST and DB-Index of single items are firstly obtained through scanning sequence database once. Then a new algorithm for fast support counting is developed to mine the table ST to identify the frequent single items. Based on a connection strategy, the frequent patterns are expanded and the expanded table ST is updated by scanning the DB-Index. The fast support counting algorithm is used for obtaining the frequent expansion patterns. Finally, a new pruning technique is developed for extended pattern to avoid the generation of unnecessarily large number of candidate patterns. The experiments results on multiple classical protein sequences from the Pfam database validate the performance of the proposed algorithm including the accuracy, stability and scalability. It is showed that the proposed algorithm has achieved the better space efficiency, stability and scalability comparing with MSPM, FBSB which are the two main algorithms for biological sequence mining. Introduction Biological sequence is an important component of bioinformatics data, generally including three categories: DNA sequence, RNA sequence and protein sequence [1]. Since the human genome project was completed in 2003, we have seen an explosive growth in bioinformatics data. By April 2017, there are 200877884 sequences in the GenBank [2] database. Since the release in 1982, the base number in GenBank has doubled by about every 18 months. Sequential pattern mining is an important method used for discovering frequent patterns and association rules in the data mining field [3], providing an effective way to find important rules of biological sequences. Biological sequence patterns can be used to predict human diseases and provide evidences for artificial nucleotides, artificial proteins and so on. Biological sequence pattern mining has become a hot research direction in recent years [4]. Frequent pattern mining is an important part of sequential pattern mining. Frequent pattern mining algorithms [5] contain two main categories, the Apriori based algorithm and the FP-growth based algorithm. The Apriori algorithm proposed by Agrawal et al [6] is the most commonly used in the association rule discovery. The frequent pattern mining has appeared in a variety of applications such as sequence pattern mining [7], structural mining [8], classification association mining [9], frequent pattern based on clustering [10] and so on. A number of sequential pattern mining algorithms have been proposed during the past years, such as incremental mining [11], top-k sequence pattern mining [12], maximum sequence pattern mining [13], constraint sequence pattern [14], weighted sequence pattern mining [15], closed sequence pattern mining [16±17]. To improve the efficiency and accuracy Oza K et al. [18] proposed an algorithm for regular expression constraint, weight constraint and length constraint to solve the problems of user interests, optimization for support threshold and accuracy. Xue F et al. [19] improved the PrefixSpan algorithm [20], and then proposed PrefixSpan-x to reduce unnecessary memory usage. Kemmar A et al. [21] proposed a top-k sequence pattern algorithm based on prefix projection and global constraints. This global constraint can take into account the quantity, item relation and regular expression and so on easily. The classical sequence pattern algorithm is the footstone of biological sequence mining. The current sequence pattern algorithms are mainly improved from efficiency and precision. With the improvement of computer hardware performance, the algorithms with high memory utilization and changing time with space emerge. Lin et al. [22] proposed a fast sequence pattern algorith (...truncated)


This is a preview of a remote PDF: https://journals.plos.org/plosone/article/file?id=10.1371/journal.pone.0195601&type=printable

Weina Li, Jiadong Ren. MpBsmi: A new algorithm for the recognition of continuous biological sequence pattern based on index structure, PLOS ONE, 2018, Volume 13, Issue 4, DOI: 10.1371/journal.pone.0195601