Robust Learning Algorithm Based on Iterative Least Median of Squares
Andrzej Rusiecki
Outliers and gross errors in training data sets can seriously deteriorate the performance of traditional supervised feedforward neural networks learning algorithms. This is why several learning methods, to some extent robust to outliers, have been proposed. In this paper we present a new robust learning algorithm based on the iterative Least Median of Squares, that outperforms some existing solutions in its accuracy or speed. We demonstrate how to minimise new non-differentiable performance function by a deterministic approximate method. Results of simulations and comparison with other learning methods are demonstrated. Improved robustness of our novel algorithm, for data sets with varying degrees of outliers, is shown.
1 Introduction
Feedforward artificial neural networks (FNN) have been successively applied in areas such
as function approximation, pattern recognition or signal and image processing. Because the
FNNs are universal approximators [9,10], they can potentially be used in any type of
problems that require modelling of unknown inputoutput dependencies. Such networks build
their models based on training sets consisting of exemplary inputoutput patterns. The main
advantage of such approach is its simplicity, since any prior knowledge about modelled
system is not required. These networks are usually trained to minimise an error function defined
to measure the distance between the current and desired output. During the training process,
FNNs try to fit the training data as close as possible. Unfortunately, the performance of this
type of learning scheme relies strongly on the quality of training data [8,11,15]. When the
data are corrupted with large noise or outliers the network is trained on erroneous examples
and tries to model a system different from the desired one. This is because the most popular
backpropagation (BP) learning algorithm and many of its variants use the mean squared error
(MSE) function. This strategy, based on the least mean squares method, is optimal only for
the clean data or data with normal error distribution.
Outliers may be defined as observations deviating strongly from the majority of the data.
Unfortunately, in routine data, the quantity of outliers can range from 1 to 10 % [8], or in
certain cases even more. They may be caused by measurement errors, human mistakes such
as errors in copying or wrong decimal points, long-tailed noise resulting in different sample
distribution, measurements of members of wrong populations, rounding errors and many
other reasons.
When we deal with multidimensional data set, finding even one outlying observation
involves computationally expensive methods. In the case when more outliers exist, the
situation becomes obviously much more complicated.
In this paper, we present a new learning algorithm that is robust to various degrees of
outlying data in training sets. The novel algorithm takes advantage of the idea of the least
median of squares estimator. It is applied iteratively to remove outliers from the training data,
but it provides also satisfactory performance when the network is trained on the clean data
set.
k=1
j=1
2 Network Training with Outliers
The feedforward networks learning algorithms, that are based on the minimisation of some
kind of criterion function, use backpropagation to calculate the performance gradient with
respect to network weights (and biases which may be also considered as additional weights).
To introduce network performance function, let us consider, without loss of generality, a
simple three layer feedforward neural network with one hidden layer. We assume that the
training set consists of N pairs:
{(x1, t1), (x2, t2), . . . , (xN , t N )}, where xi R p denotes the p-dimensional i th input vector
and ti Rq the corresponding q-dimensional network target. For the given input vector
xi = (xi1, xi2, . . . , xi p)T , the output of the j th of l neurons of the hidden layer may be
calculated as:
zi j = f1
w jk xik b j
= f1(i npi j ),
for j = 1, 2, . . . , l,
where f1() is the activation function of the hidden layer, i npi j is the sum of its weighted
inputs, w jk is the weight between the kth net input and j th neuron, and b j is the bias of the
j th neuron. For such network its output yi = (yi1, yi2, . . . , yiq )T is given as:
wvj zi j bv
= f2(i npiv ),
for v = 1, 2, . . . , q.
Here f2() denotes the output layer activation function, wvj is the weight between the
vth neuron of the output layer and the j th neuron of the hidden layer, and bv is the bias
of the vth neuron of the output layer. When f1 and f2 are similar, these equations can be
simplified, however for the function approximation or regression task, the most common
approach is to use the sigmoid activation function in the hidden layers and linear activation
in its output.
For the residuals ri written as:
the performance function may be defined as:
where (ri ) is a symmetric and continuous loss function [8], ri is an error for the i -th
training pattern (3), and N is the number of elements in the training set. The most popular loss
function is of quadratic form:
For the quadratic loss function we obtain the minimised error equal to the MSE:
v=1
ri =
|(yiv tiv )|,
1
E = N
i=1
1
Emse = N
i .
2
i=1
ri
The influence function [8,14] was introduced to measure the impact of data errors to the
training process. It may be defined as a derivative of the loss function with respect to
residuals:
If we assume the MSE performance function, then the influence function becomes linear:
which means the larger the error, the more it affects the training process. Since large errors are
often caused by outliers, this phenomenon seems to be very dangerous. This is why various
robust learning algorithms based on robust estimators have been proposed [1,2,14,20].
3 Robust Learning Algorithms
In the field of robust statistics [8,11] many methods to deal with the problem of outliers have
been proposed. They are designed to act properly when the true underlying model deviates
from the assumptions, such as normal error distribution. There are robust methods that detect
and remove outlying data before the model is built, but more of them, including robust
estimators, should be efficient and reliable even if outliers appear. Simultaneously, they should
perform well for the observations that are very close to the assumed model.
The simplest idea to make the traditional neural network learning algorithm more robust to
outliers is to replace the quadratic error with another symmetric and continuous loss function,
resulting in the nonlinear influence function. Such nonlinearity should reduce the influence of
large errors. Robust loss functions can be based on the robust estimators with proved ability
to tolerate different amounts of outlying data. Replacing the MSE performance function with
a new robust function results in robust learning method with the reduced impact of outliers.
Several such algorithms desti (...truncated)