Keyword Search over Probabilistic XML Documents Based on Node Classification
Keyword Search over Probabilistic XML Documents Based on Node Classification
Yue Zhao, Ye Yuan, and Guoren Wang
College of Information Science and Engineering, Northeastern University, Liaoning, Shenyang 110819, China
Received 22 August 2014; Revised 31 October 2014; Accepted 31 October 2014
Academic Editor: Amaury Lendasse
Copyright © 2015 Yue Zhao et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
Abstract
This paper describes a keyword search measure on probabilistic XML data based on ELM (extreme learning machine). We use this method to carry out keyword search on probabilistic XML data. A probabilistic XML document differs from a traditional XML document to realize keyword search in the consideration of possible world semantics. A probabilistic XML document can be seen as a set of nodes consisting of ordinary nodes and distributional nodes. ELM has good performance in text classification applications. As the typical semistructured data; the label of XML data possesses the function of definition itself. Label and context of the node can be seen as the text data of this node. ELM offers significant advantages such as fast learning speed, ease of implementation, and effective node classification. Set intersection can compute SLCA quickly in the node sets which is classified by using ELM. In this paper, we adopt ELM to classify nodes and compute probability. We propose two algorithms that are based on ELM and probability threshold to improve the overall performance. The experimental results verify the benefits of our methods according to various evaluation metrics.
1. Introduction
Traditional databases only manage deterministic information, but many applications use databases to involve uncertain data such as information extraction, information integration, and web data mining. Because of the flexibility of XML data model, it can easily allow a natural representation of uncertain data. Now, many probabilistic XML models are designed and analyzed [1–4]. This paper selects a popular probabilistic XML model [5], which is discussed in [6]. In this model, a probabilistic XML document (called a -document) is considered as a labeled tree which has two types of nodes, nodes and nodes. Ordinary node is used to represent the actual data and distributional node is used to represent the probability distribution of the child nodes. There are two types of distributional nodes, IND and MUX. If a node is an IND node, its children nodes are of each other, while the children of a MUX node are ; that means, at most, one child can exist in a random instance document (a ). A real number from is attached on each edge in an XML tree, indicating the conditional probability that the child node will appear under the parent node given the existence of its father node. From the attribute of a MUX node, we can see that the sum of all the existence probabilities of children nodes is 1 or less than 1.
Keyword search has been widely applied on XML data. It is considered to be an effective information discovery method to query XML data. Users do not need know the knowledge of the underlying data structures and complex query language beforehand. So, keyword search is an easy method for ordinary users to retrieve information. Keyword search on XML data is different from the query on text data. As a result, a subtree rooted at a common ancestor node will replace the whole text data. In the past years, the definition of a common ancestor node has several choices, such as LCA (lowest common ancestor), SLCA (smallest LCA), and ELCA (exclusive LCA). These definitions are used to determine the users’ query intentions. SLCA and ELCA are the subset of LCA by adding some restrictive factor. In many cases, the size of a set determines the accuracy of the query. This paper selects SLCA as the root node of result subtree because that SLCA nodes set is the smallest set in all the definitions based on LCA.
It is known that both neural networks and SVM () have been playing the dominant roles out of numerous computational intelligence techniques. But they face three challenging issues such as slow learning speed, trivial human intervene, and poor computational scalability. ELM [7, 8] as emergent technology works for generalized single-hidden layer feedforward networks (SLFNs). ELM [9–12] has good performance on classification applications and can be used to classify nodes before query XML data. Classification is considered as an important cognitive computation task [13–16]. An XML data tree can be seen as a set of all the nodes including root node (only one), connected nodes, and leaves nodes. A connected node has only one father node and one or more children nodes. The keyword usually appears in the leaves nodes or its father node of a leaf node. So, the classific (...truncated)