Refining a Taxonomy by Using Annotated Suffix Trees and Wikipedia Resources
Ann. Data. Sci. (2015) 2(1):61–82
DOI 10.1007/s40745-015-0032-1
Refining a Taxonomy by Using Annotated Suffix Trees
and Wikipedia Resources
Ekaterina Chernyak1 · Boris Mirkin1
Received: 29 September 2014 / Revised: 19 March 2015 / Accepted: 21 March 2015 /
Published online: 2 April 2015
© Springer-Verlag Berlin Heidelberg 2015
Abstract A step-by-step approach to taxonomy construction is presented. On the first
step, the upper layer frame of taxonomy is built manually according to educational
materials. On the next steps, the frame is refined at a chosen topic using the Wikipedia
category tree and articles, both cleaned of noise. Our main tool in this is a naturally
defined string-to-text relevance score, based on annotated suffix trees. The relevance
scoring is used at several tasks: (1) cleaning the Wikipedia tree or page set of noise;
(2) allocating Wikipedia categories to taxonomy topics; (3) deciding whether an allocated category should be included as a child to the taxonomy topic, etc. The resulting
fragment of taxonomy consists of three parts: the manually set upper layer topic, the
adopted part of the Wikipedia category tree and Wikipedia articles as leaves. Every
leaf is assigned a set of so-called descriptors; these are phrases explaining aspects of
the leaf topic. The method is illustrated by its application to two domains in the area
of Mathematics: (a) “Probability theory and mathematical statistics”, (b) “Numerical
mathematics” (both in Russian).
Keywords
Suffix tree
Taxonomy refinement · String-to-text relevance · Utilizing Wikipedia ·
1 Introduction: Motivation and Background
Taxonomy of concepts in a knowledge domain, or hierarchical ontology, is a popular computational instrument for representation, maintaining and usage of domain
B Ekaterina Chernyak
;
Boris Mirkin
1
Higher School of Economics, National Research University, Moscow, Russian Federation
123
62
Ann. Data. Sci. (2015) 2(1):61–82
knowledge [1–3]. A taxonomy is a rooted tree formalizing a hierarchy of subjects
in an applied domain. Such a tree corresponds to a generalizing relation between
the subjects, usually in the form “A is a B” or “A is part of B”. Automation xxxxof
taxonomy building is important for further progress in many areas of data analysis
and knowledge engineering including computationally text processing and improving
information retrieval [1,4,5]. In the authors’ work, domain taxonomies are used to
meaningfully map research results to them either to explore research profiles [6] or
annotate research papers [7] or measure the level of research results [8].
A definitive taxonomy of the domain of computer science is maintained by the
Association for Computer Machinery; the latest version of the ACM computing classification system can be found at [9]. This classification is well balanced so that: (a)
its nodes have approximately equal numbers of children, and (b) its branches have
approximately equal numbers of layers. However, there are not so many domains for
which sound taxonomies are available. For example, when we decided to shift our
efforts from the computer science domain to mathematics for the analysis of synopses
of courses in mathematics and related subjects in a Russian university, we discovered
a rather disappointing picture.
In Russian, the only publicly available taxonomy of mathematics and related
domains is the classification for the government-sponsored Abstracting Journal of
Mathematics [10] developed back in 1999. This is somewhat outdated and unbalanced. For example, it lacks such topics as “Discrete mathematics”, “Formal concept
analysis” and “Mathematical economics”. It has 157 concepts rooted at the topic
“Differential equations” and only four topics rooted at “Game theory”. Therefore we
thought that we could develop a reasonable taxonomy of mathematics if used instructive materials by the Russian Higher Attestation Commission (HAC). The HAC is
a govermental body to supervise the national system of PhD and ScD theses [11].
Its classifications are regularly updated and made publicly available as “passports of
specialties”; the list of specialties is revised once in a decade or two. For the case of
Mathematics, HAC classification is illustrated in Table 1. As one can see, it covers
just two layers of the mathematics domain and one cannot use it in the analysis of a
university curriculum, because more layers are needed to reach an adequate degree of
granularity of mathematical concepts.
This defines the problem we are going to address as a problem in taxonomy refinement. We start with a manually set an upper part of the taxonomy, a taxonomy frame
including the root subject, and then automatically refine leaves of the taxonomy oneby-one. Therefore, given a leaf subject, we need a method that would find appropriately
refined concepts and use them to grow the taxonomy. The problem of refinement of
taxonomy subjects has received some attention in the literature. A big question arising
before any refinement starts is about the sources for generating refined topics. A naive
approach is to take a search engine such as Google and run a specially designed query
involving the leaf concept under consideratiuon “A”, such as “A consists of…” or “A
is a …” [12]. Such a query would lead to a set of concepts that can be considered
as potential subtopics for topic A. This works well if the ontology is represented by
means of a formal language, such as OWL, by introducing new logical relations [13].
Yet in a less formal context the approach leads to somewhat dubious and messy results.
123
Ann. Data. Sci. (2015) 2(1):61–82
63
Table 1 The set of main mathematics divisions according to [11]. One can easily see differences from the
divisions in the classification of Mathematics subjects developed by the American Mathematics Society
[26]. For example, the field of computer science here is presented with the Numerical mathematics, and
Combinatorics, with Discrete mathematics and mathematical cybernetics
Mathematics
1
Real-valued, complex valued and functional analysis
2
Differential equations and dynamic systems
3
Mathematical problems in physics
4
Geometry and topology
5
Probability theory and mathematical statistics
6
Mathematical logics, algebra and number theory
7
Numerical mathematics
8
Discrete mathematics and mathematical cybernetics
Next idea is to use a manually designed universal taxonomy such as Wikipedia so
that the choice of topics comes from a well defined hierarchical structure openly available in the Internet. Indeed, the idea of using the Wikipedia as a major source of topics
for taxonomy building is becoming much popular [12,14–16]. Wikipedia covers many
specific knowledge domains and offers a lot of data types, such as unstructured texts,
images, the category trees, revision history, redirect pages and links, etc. There are
several features making Wikipedia a unique and highly convenient tool for taxonomy
building [17]:
– Wik (...truncated)