Classification of Multivariate Time Series and Structured Data Using Constructive Induction
MOHAMMED WALEED KADOUS CLAUDE SAMMUT
0
Eamonn Keogh
0
School of Computer Science and Engineering, University of New South Wales
,
Sydney, Australia
We present a method of constructive induction aimed at learning tasks involving multivariate time series data. Using metafeatures, the scope of attribute-value learning is expanded to domains with instances that have some kind of recurring substructure, such as strokes in handwriting recognition, or local maxima in time series data. The types of substructures are defined by the user, but are extracted automatically and are used to construct attributes. Metafeatures are applied to two real domains: sign language recognition and ECG classification. Using metafeatures we are able to generate classifiers that are either comprehensible or accurate, producing results that are comparable to hand-crafted preprocessing and comparable to human experts.
1. Introduction
There are many domains that do not easily fit into the static attribute-value model so
common in machine learning. These include multivariate time series, optical character
recognition, sequence recognition, basket analysis and web logs. Consequently, researchers
hoping to apply attribute-value learners to these domains have few choices: apply
handcrafted preprocessing, write a learner specifically designed for the domain, or use a learner
with a more powerful representation, such as relational learning or graph-based induction.
However, each of these has problems. Hand-crafted preprocessing is frequently used, but
is time-consuming and requires in-depth domain knowledge. Writing a custom learner is
possible, but is labour-intensive. Relational learning techniques tend to be very sensitive
to noise and to the particular clausal representation selected. They are typically unable to
process large data sets in a reasonable time frame, and/or require the user to set limits on
the search such as refinement rules (Cohen, 1995).
In this paper, we use a generic constructive induction technique to allow for domains
where instances exhibit recurring substructures. For example, with Chinese character
recognition, the recurring substructure is a stroke. The user defines the recurring substructures
(termed metafeatures), but subsequent steps are automated. Further, our experimental
results show that a small set of generic metafeatures may be applicable to many temporal
domains. These substructures are extracted, and a novel clustering algorithm is used to
construct synthetic attributes based on the presence or absence of certain substructures.
Standard learners can then be applied.
The learnt concepts are expressed using the same substructures identified by the user.
Since these substructures are frequently the same concepts humans use themselves in
classifying instances, this results in readable descriptions. To our knowledge, there are very few
other systems that build classifiers for multivariate time series that are comprehensible.
There are several novel aspects to the approach. Recurring substructures are processed to
construct a set of features using a specially designed clustering technique. Temporal events,
global properties of the time series and specified attributes can all be combined within the
well-understood propositional framework. The results of propositional learning are
postprocessed to generate more human-readable descriptions. The net effect of employing these
techniques is a system that can easily and simply be applied to new temporal classification
problem domains with little or no modifications.
This paper begins by motivating work in this field, and providing two examples, before
giving an overview of both related fields and directly related work. A theoretical and practical
definition of the problem, is then presented. An overview of the approach taken using a
pedagogical domain is given, followed by a discussion of the implementation of TClass,
our temporal classification learner, with several extensions to the basic idea. Experimental
results using TClass on several domains are presented, followed by a conclusion and some
suggestions for future work.
Prevalence and importance of temporal classification domains
There are many real domains that are temporal in nature. An examination of the UCI
repository (Blake & Merz, 1998) reveals that there are at least six domains that were
originally temporal classification tasks but were propositionalised to make it possible to
use attribute-value learners.1
Examples of other problems that are temporal classification tasks include: gesture
recognition, printed character recognition, speaker identification and/or authentication,
classification of medical time series such as electrocardiographs and electroencephalograms, robot
sensor data analysis and more.
Given the importance and prevalence of these temporal classification problems, this work
builds a tool that could build classifiers for these kinds of domains and apply it to them
out-of-the-box in much the same way that a toolkit like Weka (Witten & Frank, 1999) is
applied to propositional problems.
Consider two application domains of classification of multivariate time series (we will term
this temporal classification for convenience). These two examples will provide us with some
insights into the nature of the problem.
2.2.1. Tech support. This pedagogical domain is meant as an extremely simple example of
temporal classification. A computer company called SoftCorp provides telephone technical
support. These phone calls are recorded for analysis. SoftCorp discovers that the handling
of these phone calls has a huge impact on future buying patterns of its customers, so based
on the recordings of tech support responses, they hope to find the critical difference between
happy and angry customers.
An engineer suggests that the volume level of the conversation is an indication of
frustration level. SoftCorp divides each phone call into 30-second segments; and work out the
average volume in each segment. If it is a high-volume conversation, it is marked as H,
while if it is at a reasonable volume level, it is labelled as L. On some subset of their
data they determine whether the tech support calls resulted in happy or angry customers by
independent means. Note that conversations are not of fixed length; some conversations are
short, others take a bit longer.
Six examples of recorded phone conversations are show in Table 1. SoftCorp would like
to employ machine learning to find rules to predict whether, at the end of a conversation, a
customer is likely to be happy or angry.
2.2.2. Recognition of signs from a sign language. Consider the task of recognising signs
from a sign language2 using instrumented gloves. The glove provides the information shown
in Table 2.
Each instance is labelled with its class, and all of the values are sampled approximately
23 times a second. Each training instance consists of a sequence of measurements. Training
instances di (...truncated)