Clustering XML documents by patterns
Maciej Piernik
Dariusz Brzezinski
Tadeusz Morzy
Now that the use of XML is prevalent, methods for mining semi-structured documents have become even more important. In particular, one of the areas that could greatly benefit from in-depth analysis of XML's semi-structured nature is cluster analysis. Most of the XML clustering approaches developed so far employ pairwise similarity measures. In this paper, we study clustering algorithms, which use patterns to cluster documents without the need for pairwise comparisons. We investigate the shortcomings of existing approaches and establish a new pattern-based clustering framework called XPattern, which tries to address these shortcomings. The proposed framework consists of four steps: choosing a pattern definition, pattern mining, pattern clustering, and document assignment. The framework's distinguishing feature is the combination of pattern clustering and document-cluster assignment, which allows to group documents according to their characteristic features rather than their direct similarity. We experimentally evaluate the proposed approach by implementing an algorithm called PathXP, which mines maximal frequent paths and groups them into profiles. PathXP was found to match, in terms of accuracy, other XML clustering approaches, while requiring less parametrization and providing easily interpretable cluster representatives. Additionally, the results of an in-depth experimental study lead to general suggestions concerning pattern-based XML clustering.
-
applications of XML have inspired the development of hundreds of domain-specific
languages, including information technology: SOAPmessage exchange protocol [41],
healthcare: CDApart of the HL7 standard [13], bioinformatics: PDBMLProtein Data Bank
XML [42], and mathematics: MathMLmathematical notation language [40].
Such a rapid expansion of this standard led to the point, where huge amounts of XML are
being generated every day. These data constitute a potentially important source of business
and scientific knowledge, which due to its size requires automated processing. In order to
automatically extract knowledge from such amounts of data, XML mining methods need to
be employed [15]. One of the most important XML mining tasks is XML clustering, which
partitions a dataset into groups of presumably similar documents. The key to successful
clustering lies in the definition of a similarity measure.
In the context of XML, there are three ways of measuring similarity between documents:
omitting the structure (metadata, hierarchy) of documents and treating them as ordinary text
documents [23,35], omitting the content of documents and relying solely on structure [6,
24,26,28], or considering both structure and content [37,44,45]. However, it is believed
that structural information contained in XML documents cannot be ignored and algorithms
dedicated to processing text documents are inappropriate for XML document clustering [10].
In this paper, we will focus on structural approaches.
1.1 Shortcomings of existing approaches
Given a dataset D of n objects, the purpose of clustering is to divide D into k groups of objects
(clusters), such that objects within clusters are more similar to one another than to objects from
different clusters. The most common categorization of clustering methods divides them into
partitional and hierarchical approaches. Partitional clustering starts with an initial partitioning
and iteratively improves it until reaching an algorithm-specific stop condition. Hierarchical
clustering methods iteratively split/merge clusters and produce a hierarchy, called a
dendrogram, which reflects the order in which clusters were split/merged. The most common
hierarchical method is the agglomerative hierarchical clustering algorithm (AHC), which merges
two most similar clusters at each iteration. The three most popular merging strategies are as
follows: single link, complete link, and average link, in which the distance between clusters
is computed as the closest, furthest, and average distance between objects, respectively.
In structural XML clustering, both partitional and hierarchical methods are used. The
characteristic feature of structural clustering algorithms lies in the definition of the similarity
measure used to group documents. One of the basic approaches to calculating document
similarity is the tag-only approach, which measures the number of common tags between
each pair of documents. However, for documents which differ mostly in structure rather than
tag counts, this approach gives a poor estimate of similarity between documents. This
shortcoming is illustrated in Example 1. For the ease of presentation, we will discuss examples
using the single link agglomerative hierarchical clustering algorithm; however, using
complete link, average link, or a partitional method, such as k-means, would yield an identical
result in terms of clustering quality.
Example 1 Let us consider a dataset consisti (...truncated)