A method to improve the performance of multilayer perceptron by utilizing various activation functions in the last hidden layer and the least squares method
Krzysztof Halawa
This article presents a fast and uncomplicated method to modify multilayer perceptrons allowing for a considerable single-step reduction of the cost function which in this case is the mean of squared errors. The method consists in, but is not limited to the change of neuron activation functions in the last hidden layer and in the single application of the least squares method. No changes are made to neuron weights in any hidden layer. Some essential strong points of the method lie in the fact that it can be used to improve operation of networks trained earlier and the learning process need not be started from the very beginning.
-
Many problems are encountered with learning a multilayer perceptron (MLP). Local minima
of the cost function cause serious difficulty. The results are greatly influenced by initial values
of weights. Often, the learning process is run repeatedly with various initial weights. Usually,
gradient algorithms, such as the LevenbergMarquardt, the conjugate gradients and the
variable metric ones are used to learn MLP. The papers [1,2] among others show the learning way
with the recurrent least squares method (LSM). There are algorithms which successively use
non-linear optimization techniques together with the LSM. The staggered training of MLP
[3] is one of them. Initially, LSM is used to optimize the weights of the output layer. The
weights of the remaining layers are then subjected to nonlinear optimization. These two steps
are repeated alternately. When determination of the weights is completed, they are subjected
to pruning in order to reduce the number of connections and to improve the network
generalization capability. The MLP learning problems convinced numerous researchers to search for
other network structures where all parameters are subjected to changes during learning and
could be optimized in a single step using LSM. For instance, the networks with orthogonal
activation functions where changes are made only for linear output neurons were proposed in
[47]. A significant drawback of these networks lies in the fact that as the number of inputs
increases, the number of weights grows exponentially. In [8] it was proven that for MLP
with one hidden layer and with neurons having sigmoidal activation functions, the integrated
squared error is of the order O( n1 ), where n is the number of neurons in the hidden layer. It
was assumed that the function approximated by the network was bound on the first moment
of the magnitude distribution of the Fourier transform. Barron [8] also demonstrated that
for networks with radial basis functions (RBFs), the integrated squared approximation error
cannot be less than O( n21/q ), where q is the input-dimension.
Despite of a difficult learning process, MLP is one of the most popular network structures
since it has essential advantages.
A predisposition for operation with multi-dimensional data being one of them. There are
numerous applications where MLPs have hundreds of inputs and large amounts of neurons.
From time to time, an operating network needs to be improved without the necessity to
restart the whole learning and pruning processes from the beginning. This article proposes such
a method.
Neurons in the hidden layers of MLP have activation functions of the sigmoidal shape.
An example of such a function is f (x ) = tanh(x ). In order to reduce the number of
numerical calculations, the hyperbolic tangent can be replaced by the bipolar function f (x ) =
2 1
1+exp(2x) 1 or the binary function f (x ) = 1+exp(2x) . Many microcontrollers have a low
processing capacity. Look-up tables with values necessary for interpolation of the activation
function or piecewise polynomial models [9] may be used in devices under the control of
such microcontrollers. Numerous papers have been published where various shapes of
activation functions were examined. The iteration process with non-linear optimization
algorithms was used to select the parameters of these functions. In [10] the activation function
f (x ) = a(11+eexxpp((bbxx))) was used, where a and b were the parameters selected with gradient
algorithms. In [11], the CatmullRom spline curves were applied.
In [12], it was proven that the perceptron with at least one hidden layer is a universal
approximator provided the activation function of the neurons are squashing functions, i.e. a
function f : R [0, 1] is a squashing function if it is non-decreasing, limx f (x ) = 1
and limx f (x ) = 0 If limx f (x ) = 1, then this MLP is also the universal
approximator.
The further structure of this article is as follows: Section 2 outlines the proposed method.
Section 3 provides the results of numerical experiments. A comparison was made between the
results gained by the proposed method as compared to those reached by MLP with sigmoid
activation functions in all hidden layers. A summary is placed at the end of this article.
2 Proposed method
If the proposed method is used, the activation functions of the neurons in the last hidden layer
may take various shapes. To select the shapes of these all activation functions, it is merely
necessary to solve the set of the normal equations once. In this article, it was assumed that
MLP, whose performance is to be improved, has neurons with sigmoid activation functions
in hidden layers, whereas neurons in the output layer have the linear activation function
fout (x ) = x . Many times this layer possesses neurons of the linear activation function.
For the sake of a concise notation, formulae are given for MISO MLP (Multiple Input,
Single Output). A similar procedure is used for networks with several outputs. It was assumed
that the minimized cost function is
Fig. 1 The structure of the k-th
neuron in the last hidden layer,
denotes the adder, r is the
number of neuron inputs,
vk,1, vk,2, . . . , vk,r are the
weights assigned to the
connections with the previous
layer
1
E = N
i=1
F (ui ) di
where F (ui ) is the network output value when the network inputs are equal to the elements
of the vector ui = u1,i , u2,i , . . . , uq,i , T , q is the number of the network inputs, di is the
desired value of the network output assigned with ui , N is the number of the pairs {ui , di }
in the learning set. The formula (1) is commonly used as a cost function.
Let f (x ) denote the activation function in the last hidden layer prior application of the
proposed method. Let g1(x ), g2(x ), . . . , gm (x ) be the consecutive functions of the series
2h1
, . . . , f (x ), . . . , f 2h1x , f 2h x ,
where h is a positive integer, m = 2h + 1. In applications where a small number of
arithmetical operation is significant, the author of this article recommends assumption h < 2.
Increasing of h may result in significant decrease of the cost function. After selection of h,
the activation functions of all neurons in the last hidden layer are changed into the functions
fk (x ) = wk,1g1(x ) + wk,2g2(x ) + + wk,m gm (x ),
where fk (x ) (...truncated)