A method to improve the performance of multilayer perceptron by utilizing various activation functions in the last hidden layer and the least squares method (pdf)

Article PDF cannot be displayed. You can download it here:

https://link.springer.com/content/pdf/10.1007%2Fs11063-011-9199-4.pdf

A method to improve the performance of multilayer perceptron by utilizing various activation functions in the last hidden layer and the least squares method

Krzysztof Halawa This article presents a fast and uncomplicated method to modify multilayer perceptrons allowing for a considerable single-step reduction of the cost function which in this case is the mean of squared errors. The method consists in, but is not limited to the change of neuron activation functions in the last hidden layer and in the single application of the least squares method. No changes are made to neuron weights in any hidden layer. Some essential strong points of the method lie in the fact that it can be used to improve operation of networks trained earlier and the learning process need not be started from the very beginning. - Many problems are encountered with learning a multilayer perceptron (MLP). Local minima of the cost function cause serious difficulty. The results are greatly influenced by initial values of weights. Often, the learning process is run repeatedly with various initial weights. Usually, gradient algorithms, such as the LevenbergMarquardt, the conjugate gradients and the variable metric ones are used to learn MLP. The papers [1,2] among others show the learning way with the recurrent least squares method (LSM). There are algorithms which successively use non-linear optimization techniques together with the LSM. The staggered training of MLP [3] is one of them. Initially, LSM is used to optimize the weights of the output layer. The weights of the remaining layers are then subjected to nonlinear optimization. These two steps are repeated alternately. When determination of the weights is completed, they are subjected to pruning in order to reduce the number of connections and to improve the network generalization capability. The MLP learning problems convinced numerous researchers to search for other network structures where all parameters are subjected to changes during learning and could be optimized in a single step using LSM. For instance, the networks with orthogonal activation functions where changes are made only for linear output neurons were proposed in [47]. A significant drawback of these networks lies in the fact that as the number of inputs increases, the number of weights grows exponentially. In [8] it was proven that for MLP with one hidden layer and with neurons having sigmoidal activation functions, the integrated squared error is of the order O( n1 ), where n is the number of neurons in the hidden layer. It was assumed that the function approximated by the network was bound on the first moment of the magnitude distribution of the Fourier transform. Barron [8] also demonstrated that for networks with radial basis functions (RBFs), the integrated squared approximation error cannot be less than O( n21/q ), where q is the input-dimension. Despite of a difficult learning process, MLP is one of the most popular network structures since it has essential advantages. A predisposition for operation with multi-dimensional data being one of them. There are numerous applications where MLPs have hundreds of inputs and large amounts of neurons. From time to time, an operating network needs to be improved without the necessity to restart the whole learning and pruning processes from the beginning. This article proposes such a method. Neurons in the hidden layers of MLP have activation functions of the sigmoidal shape. An example of such a function is f (x ) = tanh(x ). In order to reduce the number of numerical calculations, the hyperbolic tangent can be replaced by the bipolar function f (x ) = 2 1 1+exp(2x) 1 or the binary function f (x ) = 1+exp(2x) . Many microcontrollers have a low processing capacity. Look-up tables with values necessary for interpolation of the activation function or piecewise polynomial models [9] may be used in devices under the control of such microcontrollers. Numerous papers have been published where various shapes of activation functions were examined. The iteration process with non-linear optimization algorithms was used to select the parameters of these functions. In [10] the activation function f (x ) = a(11+eexxpp((bbxx))) was used, where a and b were the parameters selected with gradient algorithms. In [11], the CatmullRom spline curves were applied. In [12], it was proven that the perceptron with at least one hidden layer is a universal approximator provided the activation function of the neurons are squashing functions, i.e. a function f : R [0, 1] is a squashing function if it is non-decreasing, limx f (x ) = 1 and limx f (x ) = 0 If limx f (x ) = 1, then this MLP is also the universal approximator. The further structure of this article is as follows: Section 2 outlines the proposed method. Section 3 provides the results of numerical experiments. A comparison was made between the results gained by the proposed method as compared to those reached by MLP with sigmoid activation functions in all hidden layers. A summary is placed at the end of this article. 2 Proposed method If the proposed method is used, the activation functions of the neurons in the last hidden layer may take various shapes. To select the shapes of these all activation functions, it is merely necessary to solve the set of the normal equations once. In this article, it was assumed that MLP, whose performance is to be improved, has neurons with sigmoid activation functions in hidden layers, whereas neurons in the output layer have the linear activation function fout (x ) = x . Many times this layer possesses neurons of the linear activation function. For the sake of a concise notation, formulae are given for MISO MLP (Multiple Input, Single Output). A similar procedure is used for networks with several outputs. It was assumed that the minimized cost function is Fig. 1 The structure of the k-th neuron in the last hidden layer, denotes the adder, r is the number of neuron inputs, vk,1, vk,2, . . . , vk,r are the weights assigned to the connections with the previous layer 1 E = N i=1 F (ui ) di where F (ui ) is the network output value when the network inputs are equal to the elements of the vector ui = u1,i , u2,i , . . . , uq,i , T , q is the number of the network inputs, di is the desired value of the network output assigned with ui , N is the number of the pairs {ui , di } in the learning set. The formula (1) is commonly used as a cost function. Let f (x ) denote the activation function in the last hidden layer prior application of the proposed method. Let g1(x ), g2(x ), . . . , gm (x ) be the consecutive functions of the series 2h1 , . . . , f (x ), . . . , f 2h1x , f 2h x , where h is a positive integer, m = 2h + 1. In applications where a small number of arithmetical operation is significant, the author of this article recommends assumption h < 2. Increasing of h may result in significant decrease of the cost function. After selection of h, the activation functions of all neurons in the last hidden layer are changed into the functions fk (x ) = wk,1g1(x ) + wk,2g2(x ) + + wk,m gm (x ), where fk (x ) (...truncated)