A PDF file should load here. If you do not see its contents
the file may be temporarily unavailable at the journal website
or you do not have a PDF plug-in installed and enabled in your browser.
Alternatively, you can download the file locally and open with any standalone PDF reader:
https://link.springer.com/content/pdf/10.1007%2Fs10994-016-5586-4.pdf
Defying the gravity of learning curve: a characteristic of nearest neighbour anomaly detectors
Mach Learn
Defying the gravity of learning curve: a characteristic of nearest neighbour anomaly detectors
Kai Ming Ting 0 1
Takashi Washio 0 1
Jonathan R. Wells 0 1
Sunil Aryal 0 1
B Jonathan R. Wells 0 1
Takashi Washio 0 1
0 The Institute of Scientific and Industrial Research, Osaka University , Ibaraki , Japan
1 School of Engineering and Information Technology, Federation University , Churchill , Australia
Conventional wisdom in machine learning says that all algorithms are expected to follow the trajectory of a learning curve which is often colloquially referred to as 'more data the better'. We call this 'the gravity of learning curve', and it is assumed that no learning algorithms are 'gravity-defiant'. Contrary to the conventional wisdom, this paper provides the theoretical analysis and the empirical evidence that nearest neighbour anomaly detectors are gravity-defiant algorithms. Editor: Joao Gama.
Learning curve; Anomaly detection; Nearest neighbour; Computational geometry; AUC
1 Introduction
In the machine learning context, learning curve describes the rate of task-specific performance
improvement of a learning algorithm as the training set size increases. A typical learning
curve of a learning algorithm is provided in Fig. 1. The error, as a measure of the learning
algorithm’s performance, decreases at a fast rate when the training sets are small; and the
r
o
r
r
E
g
n
i
t
s
e
T
Typical learning curve
Gravity-defiant learning curve
Number of training instances
rate of decrease slows gradually until it reaches a plateau as the training sets increase to large
sizes.
Conventional wisdom in machine learning says that all algorithms are expected to follow
the trajectory of a learning curve, though the actual rate of performance improvement may
differ from one algorithm to another. We call this ‘the gravity of learning curve’, and it is
assumed that no learning algorithms are ‘gravity-defiant’.
Recent research
(Liu et al. 2008; Zhou et al. 2012; Sugiyama and Borgwardt 2013; Wells
et al. 2014; Bandaragoda et al. 2014; Pang et al. 2015)
has provided an indication that some
algorithms may defy the gravity of learning curve, i.e., these algorithms can learn a better
performing model using a small training set than that using a large training set. However, no
concrete evidence of the ‘gravity-defiant’ behaviour is provided in the literature, let alone
the reason why these algorithms behave this way.
‘Gravity-defiant’ algorithms have a key advantage of producing a good performing model
using a training set significantly smaller than that required for ‘gravity compliant’ algorithms.
They will yield significant saving on time and memory space that the conventional wisdom
thought impossible.
This paper focuses on nearest neighbour-based anomaly detectors because they have
been shown to be one of the most effective class of anomaly detectors
(Breunig et al. 2000;
Sugiyama and Borgwardt 2013; Wells et al. 2014; Bandaragoda et al. 2014; Pang et al. 2015)
.
This paper makes the following contributions:
1. Provide a theoretical analysis of nearest neighbour-based anomaly detection algorithms
which reveals that their behaviours defy the gravity of learning curve. This is the first
analysis in machine learning research on learning curve behaviour that is based on
computational geometry, as far as we know.
2. The theoretical analysis provides an insight into the behaviour of the nearest neighbour
anomaly detector. In sharp contrast to the conventional wisdom: more data the better, the
analysis reveals that sample size has three impacts which have not been considered by the
conventional wisdom. First, increasing sample size increases the likelihood of anomaly
contamination in the sample; and any inclusion of anomalies in the sample increases the
false negative rate, thus, lowers the AUC. Second, the optimal sample size depends on
the data distribution. As long as the data distribution is not sufficiently represented by
the current sample, increasing the sample size will improve AUC. The optimal size is
the number of instances best represents the geometry of normal instances and anomalies;
this gives the optimal separation between normal instances and anomalies, encapsulated
as the average nearest neighbour distance to anomalies. Third, increasing the sample
size decreases the average nearest neighbour distance to anomalies. Increasing beyond
the optimal sample size reduces the separation between normal instances and anomalies
smaller than the optimal. This leads to the decreased AUC and gives rise to the
gravitydefiant behaviour.
3. Present empirical evidence of the gravity-defiant behaviour using three nearest
neighbourbased anomaly detectors in the unsupervised learning context.
In addition, this paper uncovers two features of nearest neighbour anomaly detectors:
A. Some nearest neighbour anomaly detector can achieve high detection accuracy with a
significantly smaller sample size than others.
B. A (...truncated)