Visualization of medical rule-based knowledge bases
JOURNAL OF MEDICAL INFORMATICS & TECHNOLOGIES Vol. 24/2015, ISSN 1642-6037
data mining, medical knowledge bases,
cluster visualization, hierarchical clustering, treemaps
Agnieszka NOWAK-BRZEZIŃSKA1 , Tomasz RYBOTYCKI1
VISUALIZATION OF MEDICAL RULE-BASED
KNOWLEDGE BASES
In this work the topic of applying clustering as a knowledge extraction method from real-world
data is discussed. The authors propose hierarchical clustering method and visualization technique for
knowledge base representation in the context of medical knowledge bases for which data mining
techniques are successfully employed and may resolve different problems. What is more, the authors
analyze the impact of different clustering parameters on the result of searching through such a structure.
Particular attention was also given to the problem of cluster visualization. Authors review selected,
two-dimensional approaches, stating their advantages and drawbacks in the context of representing
complex cluster structures.
1. INTRODUCTION
In the domain of Decision Support Systems and Data Mining, last decade brought along
a significant development of new algorithms, tools and applications. The knowledge bases
(KB) are constantly increasing in volume, thus the knowledge stored as a set of rules or
patterns is getting progressively more complex and much harder to interpret or analyze. Recent
advances in the field of artificial intelligence have led to the emergence of expert systems,
computational tools designed to capture and make available the knowledge of domain experts.
The number of medical expert systems is growing and thanks to progress in key areas such as
knowledge acquisition, model-based reasoning and system integration for clinical environments
their efficiency is getting better everyday. It is essential for physicians to understand the
current state of such research as well as remaining theoretical and logistic barriers before
full potential of these systems can be used and new patterns can be discovered. Among many
other methods, doctors can use the visualization and analysis of medical data for the purpose
of extracting a new and potentially hidden knowledge - common and unusual. The extraction
and discovery of knowledge hidden in the data have become particularly important in recent
years, especially when taking into consideration the constantly growing amount of information
stored in databases and data warehouses. The data is collected because it can potentially be the
source of previously unknown and useful correlations, anomalies and trends [4]. However, the
discovered patterns denominated in the form of an analytical model, may possess a complicated
structure, which hinder the further analysis process. But not only does the excessive amount
of available information affect the difficulty of research. A more important factor is their
complicated structure, both in terms of high dimensionality, as well as used data types. In this
1
Institute of Computer Science, University of Silesia, 39 Bedzińska Str., 41-200 Sosnowiec, Poland
CLASSIFICATION
paper a specific type of knowledge representation, like rules (denoted as Horn’s clauses) is
considered. Unfortunately, if we use — possibly different — tools for automatic acquisition
and/or extraction of rules, the number of them grows rapidly. For modern problems, KB can
count up to hundreds or thousands of rules. For such KBs, the number of possible inference
paths is enormous. In such cases knowledge engineer can not be totally aware that all possible
rule interactions are legal and lead to expected results. The big size of KB causes problems
with inference efficiency and interpretation of inference results. Even for domain expert it is
difficult to analyze the presented knowledge if the number of elements to analyze is too big.
In such cases clustering rules and visualizing resultant structure can be helpful.
That is why the authors propose a method of reorganization of the KB from a set of not
related rules to groups of similar rules (using cluster analysis methods). Besides the information
about the rules in each cluster the visualization of clusters is generated. Such a representation
of a KB, especially in specific areas (like medicine), seems to be very helpful for expert in
exploring the given domain.
The paper consists of 6 sections. In Section 1 the general information about the authors
scientific goals’ motivation is presented. The description of the cluster analysis idea for rules
in KB is included in Section 2. The following section presents the methods of visualization
of a hierarchical data structure. Section 4 contains the description of the software created by
authors in order to achieve grouping and graphical representation of data. The experiments
with the analysis of their results are considered in section 5. Section 6 contains the summary.
2. HIERARCHICAL CLUSTERING ALGORITHM
Hierarchical clustering (or hierarchical cluster analysis) is one of many methods of cluster
analysis. It seeks to build a hierarchical structure of clusters. Most basic hierarchical clustering
algorithms merge (or divide) only two (one) clusters during one iteration step and because of
that the resultant structure of the algorithm is tree-like. There are two types of hierarchical
clustering algorithms:
- agglomerative hierarchical clustering algorithms or AGNES (from agglomerative nesting),
- divisive hierarchical clustering algorithms or DIANA (from divisive analysis).
In divisive hierarchical clustering algorithms, at the beginning, all objects are members of
one default group. During every iteration step this basic group is divided into smaller groups
until the stop condition is met. These methods are used less often than agglomerative methods,
because finding an effective way to divide cluster is a nontrivial task [6].
Agglomerative hierarchical clustering (AHC) algorithms presents different approach. During
their each iteration step clusters are merged with other clusters. At the beginning each object
is considered a cluster itself (or one may say that each object is placed within a cluster that
consists only of that object). It can be said that these two types are reverse of one another [5].
In this paper following version of classic (basic) agglomerative hierarchical clustering algorithm [6] was used.
1) Place each object in separate cluster.
2) Build similarity matrix for every cluster pair.
3) Using similarity matrix find most similar pair of clusters and merge them.
4) Update similarity matrix.
5) If stop condition was met end the procedure.
6) Else repeat from step 3.
7) Return structure built this way.
One of the greatest advantages of these kinds of algorithms is that they are independent of
how similarity of object is described. There are many methods of specifying resemblance (or
92
CLASSIFICATION
distance) of objects of different types [6]. In some cases complex objects consists of numerical
and symbolic data are analyzed and it’s im (...truncated)