Clustering XML documents by patterns

Knowledge and Information Systems, Jan 2015

Now that the use of XML is prevalent, methods for mining semi-structured documents have become even more important. In particular, one of the areas that could greatly benefit from in-depth analysis of XML’s semi-structured nature is cluster analysis. Most of the XML clustering approaches developed so far employ pairwise similarity measures. In this paper, we study clustering algorithms, which use patterns to cluster documents without the need for pairwise comparisons. We investigate the shortcomings of existing approaches and establish a new pattern-based clustering framework called XPattern, which tries to address these shortcomings. The proposed framework consists of four steps: choosing a pattern definition, pattern mining, pattern clustering, and document assignment. The framework’s distinguishing feature is the combination of pattern clustering and document-cluster assignment, which allows to group documents according to their characteristic features rather than their direct similarity. We experimentally evaluate the proposed approach by implementing an algorithm called PathXP, which mines maximal frequent paths and groups them into profiles. PathXP was found to match, in terms of accuracy, other XML clustering approaches, while requiring less parametrization and providing easily interpretable cluster representatives. Additionally, the results of an in-depth experimental study lead to general suggestions concerning pattern-based XML clustering.

A PDF file should load here. If you do not see its contents the file may be temporarily unavailable at the journal website or you do not have a PDF plug-in installed and enabled in your browser.

Alternatively, you can download the file locally and open with any standalone PDF reader:

https://link.springer.com/content/pdf/10.1007%2Fs10115-015-0820-0.pdf

Clustering XML documents by patterns

Maciej Piernik Dariusz Brzezinski Tadeusz Morzy Now that the use of XML is prevalent, methods for mining semi-structured documents have become even more important. In particular, one of the areas that could greatly benefit from in-depth analysis of XML's semi-structured nature is cluster analysis. Most of the XML clustering approaches developed so far employ pairwise similarity measures. In this paper, we study clustering algorithms, which use patterns to cluster documents without the need for pairwise comparisons. We investigate the shortcomings of existing approaches and establish a new pattern-based clustering framework called XPattern, which tries to address these shortcomings. The proposed framework consists of four steps: choosing a pattern definition, pattern mining, pattern clustering, and document assignment. The framework's distinguishing feature is the combination of pattern clustering and document-cluster assignment, which allows to group documents according to their characteristic features rather than their direct similarity. We experimentally evaluate the proposed approach by implementing an algorithm called PathXP, which mines maximal frequent paths and groups them into profiles. PathXP was found to match, in terms of accuracy, other XML clustering approaches, while requiring less parametrization and providing easily interpretable cluster representatives. Additionally, the results of an in-depth experimental study lead to general suggestions concerning pattern-based XML clustering. - applications of XML have inspired the development of hundreds of domain-specific languages, including information technology: SOAPmessage exchange protocol [41], healthcare: CDApart of the HL7 standard [13], bioinformatics: PDBMLProtein Data Bank XML [42], and mathematics: MathMLmathematical notation language [40]. Such a rapid expansion of this standard led to the point, where huge amounts of XML are being generated every day. These data constitute a potentially important source of business and scientific knowledge, which due to its size requires automated processing. In order to automatically extract knowledge from such amounts of data, XML mining methods need to be employed [15]. One of the most important XML mining tasks is XML clustering, which partitions a dataset into groups of presumably similar documents. The key to successful clustering lies in the definition of a similarity measure. In the context of XML, there are three ways of measuring similarity between documents: omitting the structure (metadata, hierarchy) of documents and treating them as ordinary text documents [23,35], omitting the content of documents and relying solely on structure [6, 24,26,28], or considering both structure and content [37,44,45]. However, it is believed that structural information contained in XML documents cannot be ignored and algorithms dedicated to processing text documents are inappropriate for XML document clustering [10]. In this paper, we will focus on structural approaches. 1.1 Shortcomings of existing approaches Given a dataset D of n objects, the purpose of clustering is to divide D into k groups of objects (clusters), such that objects within clusters are more similar to one another than to objects from different clusters. The most common categorization of clustering methods divides them into partitional and hierarchical approaches. Partitional clustering starts with an initial partitioning and iteratively improves it until reaching an algorithm-specific stop condition. Hierarchical clustering methods iteratively split/merge clusters and produce a hierarchy, called a dendrogram, which reflects the order in which clusters were split/merged. The most common hierarchical method is the agglomerative hierarchical clustering algorithm (AHC), which merges two most similar clusters at each iteration. The three most popular merging strategies are as follows: single link, complete link, and average link, in which the distance between clusters is computed as the closest, furthest, and average distance between objects, respectively. In structural XML clustering, both partitional and hierarchical methods are used. The characteristic feature of structural clustering algorithms lies in the definition of the similarity measure used to group documents. One of the basic approaches to calculating document similarity is the tag-only approach, which measures the number of common tags between each pair of documents. However, for documents which differ mostly in structure rather than tag counts, this approach gives a poor estimate of similarity between documents. This shortcoming is illustrated in Example 1. For the ease of presentation, we will discuss examples using the single link agglomerative hierarchical clustering algorithm; however, using complete link, average link, or a partitional method, such as k-means, would yield an identical result in terms of clustering quality. Example 1 Let us consider a dataset consisti (...truncated)


This is a preview of a remote PDF: https://link.springer.com/content/pdf/10.1007%2Fs10115-015-0820-0.pdf

Maciej Piernik, Dariusz Brzezinski, Tadeusz Morzy. Clustering XML documents by patterns, Knowledge and Information Systems, 2016, pp. 185-212, Volume 46, Issue 1, DOI: 10.1007/s10115-015-0820-0