Adaptive Structural Optimisation of Neural Networks

: Structural design of an artificial neural network (ANN) is a very important phase in the construction of such a network. The selection of the optimal number of hidden layers and hidden nodes has a significant impact on the performance of a neural network, though typically decided in an adhoc manner. In this paper, the structure of a neural network is adaptively optimised by determine the number of hidden layers and hidden nodes that give the optimal performance in a given problem domain. Two optimisation approaches have been developed based on the Particle Swarm Optimisation (PSO) algorithm, which is an evolutionary algorithm which uses a cooperative approach. These approaches have been applied on two well known case studies in the classification domain, namely the Iris data classification and the Ionosphere data classification. The obtained results and comparisons done with past research work has clearly shown that this method of optimisation is by far, the best approach for adaptive structural optimisation of ANNs.


INTRODUCTION
Artificial Neural Networks (ANNs) which have been inspired by biological neural networks, are used specially in imitating many qualities seen in human beings like identifying objects and patterns, making decisions based on prior experiences and accumulated knowledge, prediction of future events based on past happenings, etc..The very fact that the human brain is very efficient in carrying out these actions is mainly attributable to its complex and intricate, but very effective neural network structure.Besides the learning algorithm of a specific neural network, constructing an effective neural network structure is perhaps the single most challenging aspect in the designing of an ANN.This is due to the high cohesiveness between the performance of a neural network and the structure of that particular neural network.Until recently the structure of a neural network was defined by intuition or based on empirical suggestions.As far as the number of hidden layers were concerned a theoretical result by Horniket alstated in [2], as '..a feed forward neural network with one layer is enough to approximate any continuous non linear function arbitrarily well on compact interval, provided Adaptive Structural Optimisation of Neural Networks N. P. Suraweera1* , D. N. Ranasinghe 2that a sufficient hidden neurons are available', may have had an influence in this way of thinking.
In the recent years, Particle Swarm Optimisation (PSO) algorithm, which is a simple, easy to implement but highly effective evolutionary algorithm, has also been used for the purpose of ANN evolution.According to the best of our knowledge, PSO has not been used thus far, to evolve a full neural network structure, i.e., both the hidden layers and the number of nodes in a particular hidden layer, presumably due to the earlier mentioned theoretical result.However, in our research we show that it is indeed possible to come up with an adaptively optimized number of hidden layers for the neural network which will also yield improved classification results.As such, this research has strived to come up with an optimal structure for an ANN by applying the PSO algorithm, on a network used in a particular problem domain.
The paper is organized as follows: In section II, a brief overview of feed-forward neural networks and Particle Swarm Optimisation is given, section III is related work, section IV discusses the design and implementation aspects, section V presents the results and section VI gives the conclusion and future work that can be carried out on the optimisation approaches.

OVERVIEW OF ANN AND PSO
The ANNs considered within this research are Multilayer Feed-Forward Neural Networks and the given sample problems are solved through supervised learning using back propagation.

Importance of the Architecture of an ANN
The architectural/topological design of the ANN has become one of the most important tasks in ANN research and application.It is known that the architecture of an ANN has significant impact on a network's information processing capabilities.Given a learning task, an ANN with only a few connections and linear nodes may not be able to perform the task at all due to its limited capability, while an ANN with a large number of connections and nonlinear nodes may overfit noise in the training data and fail to have good generalization ability [1].Up to now, architecture design is still very much a human expert's job.It depends heavily on the expert experience and a tedious trial-and-error process.Even though ANNs are easy to construct, finding a good ANN structure is a very time consuming process [2].As there are no fixed rules in determining the ANN structure or its parameter values, a large number of ANNs may have to be constructed with different structures and parameters before determining an acceptable model.Against this background, a logical next step is the exploration of more powerful techniques for efficiently searching the space of network architectures [3].

PSO
Particle Swarm Optimisation (PSO) is a population based stochastic optimisation technique developed by James Kennedy and Russell Eberhart in 1995, inspired by social behavior of bird flocking or fish schooling.PSO introduces a method for optimisation of continuous nonlinear functions [4], [5].This algorithm is simple in concept, computationally efficient and effective on a variety of problems.
PSO is initialized with a group of random particles (solutions) and then searches for optima by updating generations.In every iteration, each particle is updated by following two "best" values.
The personal best solution (fitness) it has achieved so far (measured using a fitness function).This value is called pbest.
The best value obtained so far by any particle in the population.This best value is a global best and called gbest.Apart from these values, when a particle takes part of the population as its topological neighbors, the best value is a local best and is called lbest.
After finding the above parameters, the particle updates its velocity and position with following equations (1.1) and (1.2) [4].
and v[t+1]is the particle velocity, position[t] is the current particle (solution).pbest[t] and gbest [t] are defined as stated before.rand( ) is a random number between (0,1).c1, c2 are learning factors (usually c1 = c2 = 2).The PSO algorithm [5] can be implemented by incorporating the above equations.The swarm size is a critical parameter -too few particles might cause the algorithm to become stuck in local minima, while too many particles will slow down the algorithm.The optimal number of particles per swarm will also depend on the function given in [6].

Advantages of the PSO approach
The considerable adaptability of PSO to variations and hybrids is seen as a strength over other robust evolutionary optimisation mechanisms, such as Genetic 1.

2.
Algorithms (GA).Normally, a stochastic hill-climber risks getting stuck at local maxima, but the stochastic exploration and communication of the swarm overcomes this [7].The interaction of the particles in the swarm creates a very good balance between straying off the course and staying close to the optimal solution.
The PSO algorithm is easy to implement because it is expressed in a very few lines of code, and requires only specification of the problem and a few parameters in order to solve it [4].Another advantage is that PSO takes real numbers as particles; hence eliminating the need of a special encoding scheme or the need to use special genetic operators.Compared with other evolutionary algorithms such as GA, PSO algorithm possesses attractive properties such as memory and constructive cooperation between individuals.All particles in a PSO population carry memory (in the form of the personal best value it has reached so far), whereas in a GA if an individual is not selected the information contained by that individual is lost.Because there are no selection and crossover operation in PSO, each individual in an original population has a corresponding partner in a new population.It can avoid the premature convergence and stagnation in GAs to some extent [9].
The cooperative approach followed by PSO is seen as the biggest advantage over the competitive approach taken by the GAs since, in cooperative situations, others are depending on you to succeed but in competitive situations, others hope to see you fail.So PSO is a cooperative approach to optimisation rather than an evolutionary approach which kills off unsuccessful members of the search team.It is in the collective sharing of knowledge that solutions are found.

ANN weight training using PSO
Adjusting weights to train a feed-forward multilayer ANN has been one of the earliest applications of PSO.According to Kennedy and Eberhart who are the developers of the PSO algorithm, a particle swarm optimizer could train NN weights as effectively as the usual error backpropagation method [4].One of their first experiments involved training weights for a threelayer ANN solving the exclusive-or (XOR) problem.They have also used a particle swarm optimizer to train a neural network to classify the Fisher Iris Data Set [10].Intriguing informal indications are that the trained weights found by particle swarms sometimes generalize from a training set to a test set better than solutions found by gradient descent method.
Gudise and Venayagamoorthy [8], have shown that feed-forward neural network weights converge faster with the PSO than with the back propagation algorithm.In order to compare the training capabilities of back propagation and PSO algorithm, a non-linear quadratic equation, y = 2x 2 + 1, with data points (patterns) in range (-1 , 1) has been presented to the feed-forward neural network.Based on the experimental results, the number of computations required by each algorithm shows that PSO requires less number of iterations to achieve the same error goal as compared to the back propagation.Thus, PSO is better for applications that require fast learning algorithms.An important observation made is that when the training points are fewer, the ANN learns the nonlinear function with six times lesser number of computations with PSO than that required by the back propagation.Moreover, the success of back propagation depends on choosing a bias value unlike with PSO.It is also stated that the concept of the PSO can be incorporated into back propagation algorithm to improve its global convergence rate.More recent work in this regard is in [18], [19].

Architecture evolution together with weight training of ANNs
Direct application of PSO to evolve the structure of an ANN has been done by Zhang, Shao and Li [9].Both the architecture and the weights of ANNs are adaptively adjusted according to the quality of the neural network.Recent similar work is also in [16], [17].

ANN Weight Initialization
Apart from complete weight training, PSO has also been used to initialize the weights of ANNs.Van den Bergh [11] his paper has shown that training performance can be improved significantly by using PSO to initialize the weights, rather than random initializations.
He has stated that since the weights in an ANN serve as a starting position in error space, from where the optimisation algorithms proceed to find a minimum in the error space, it is clear that the precise starting position can affect the speed and accuracy with which the algorithm will find the minimum.By the means of two case studies, namely the Ionosphere Classification Problem [10] and The Henon Curve problem, it has been shown that using PSO to initialize weights will reduce the total time needed to train Multi-Layer Perceptron networks.But it also mentions that even though PSO can be used to train the Multi-Layer Perceptron networks to completion, it will seldom be quicker than a mix between PSO and gradient-based optimisation techniques.

Other Adaptive Techniques
Eberhart, one of the creators of the PSO algorithm, and Xiaohui have evolved not only the network weights but also the slopes of the sigmoidal transfer functions of hidden and output processing elements using PSO [12].The method is general, and can be applied to other transfer functions as well.Flexibility is gained by allowing the slopes of the transfer function to be positive or negative.A change in sign for the slope is equivalent to a change in signs of all input weights.Since the PSO process is continuous, neural network evolution is also continuous.No sudden discontinuities exist such as those that plague other evolutionary approaches.

DESIGN AND IMPLEMENTATION
Initially an association was made between the parameters of the PSO and the ANN, in order to construct an algorithm which would evolve the architecture of the ANN.Since this research involves two parameters to be optimized in an ANN, namely the number of hidden layers and hidden nodes in each layer, these two parameters were mapped to appropriate variables of the PSO algorithm.

Association between PSO and ANN
The mapping resulted in the defining of a 1:1 relationship between the position variable of a particle in the PSO swarm and number of hidden nodes in a layer of an ANN.Therefore the number of hidden nodes of each hidden layer will be indirectly evolved due to the velocity parameter (v) of the PSO algorithm.The number of dimensions (the number of times the PSO equations should be iterated) was associated with the number of hidden layers in each network.Thus when executing the loop with the PSO equations, it will iterate through each hidden layer corresponding to a network, optimizing the number of hidden nodes in each layer.The global best value reflects the optimum number of hidden nodes for an optimum number of hidden layers.

Optimisation Approaches
The' Global Best' Approach In this method the position matrix values (number of hidden nodes of each hidden layer, in each network) were randomly initialized for a population of 30 particles (30 networks).This initialization was done subject to the constraints of the minimum and maximum number of hidden layers allowed in one network (the minimum number = 1, the maximum number = 5) and the number of particles in a population.Since the random generation of position variables corresponding to each network allows a value to even be zero, a cleaning process was essential to proceed with the evolution.This cleaning process was implemented so that after the initialization of the number of hidden nodes in each network, it will verify the fact that none of the networks have zero hidden nodes (which means that there is no hidden layer) in the middle of any network.In any case if there is a network which has this initial configuration, the cleaning process will remove the rest of the hidden layers also (because it is infeasible to have a network which has no hidden nodes in a prior hidden layer and has hidden nodes in the latter hidden layers).After carrying out this cleaning process, it gives a resulting population which has different numbers of hidden layers.
These networks are then trained and the performance is evaluated using the classification accuracy percentage of the ANN.The global best value of the population is defined according to the highest accuracy achieved by a network.The global best variable ('gbest'-which is similar to an array), contains the number of hidden nodes in each layer of the ANN which has given the best ever performance.The classification accuracy percentage is then checked to evaluate whether the required performance is reached by any network in the population.If so, then the program is terminated.If not, the PSO equations will be applied to the parameters of the ANN, and new values will be obtained for the number of hidden nodes in each layer.This evolution of each network was done by considering its personal best performance and the global best performance, where the latter gives the best performance ever to be reached by a network in the whole population.This process also can give rise to the cancellation of hidden layers in the middle of a network.Therefore the cleaning process will be carried out again.Then the above mentioned process will carry on iterating until the required performance is reached by any network.
The most important aspect in this method of evolution is that one instance which has obtained the best ever performance in the whole population, in all executed iterations, is kept as a global measurement which will directly influence the evolution of all other networks in the population.This clearly demonstrates the cooperative approach followed by the PSO algorithm.Fig 2 illustrates the global best approach using a flow chart.
Since it was observed that the randomly initialized population in the above method tend to mostly consist of networks belonging to one class (e.g., networks with 5 hidden layers), it was then decided to create a uniform population (i.e., similar number of networks from each class) in the first stage of the algorithm.The rest of the algorithm was carried out in the same order.

The 'Local Best' Approach
In this method, the main difference from the above method was that instead of a global best value for the whole population, local best values were taken into consideration within the PSO algorithm.A local best value was defined for each class (e.g., 5 local bests corresponding to the networks belonging to the 5 classes -1 hidden layer networks, 2 hidden layer networks, ….etc).Therefore the evolution of each network was done by considering its personal best performance and the local best performance values.This gives rise to the modification of equation 1.1 as follows.

v[t+1] = v[t]+c1*rand( )*( pbest[t] -position[t] )+ c2*rand( )*( lbest[t] -position[t] ) 1.3
Similar to the earlier situation, pbest gives the best configuration ever to be reached by each specific network while the lbest gives the best configuration within a class (number of hidden nodes in each layer of the network which has given the best performance for a given class of networks).

YES NO
For each given class above, a local best was defined by comparing the performances among the members of a class.Then each network in a class will try to achieve that specific local best corresponding to its class.Therefore a network will never change its number of hidden layers during the execution of the algorithm but will change the number of hidden nodes in its predefined hidden layers.A cleaning process was not needed within this approach due to the above reason.

Implementation Procedure
The two approaches designed above were implemented in Matlab and each method was applied on the selected application case studies.

Fishers' Iris Data Set Classification
This is a multivariate data set introduced by Sir Ronald Aylmer Fisher (1936) as an example of discriminant analysis [10] .It consists of 50 samples from each of three species of Iris flowers.Initially 75 sets of inputs (half of the data set) from the Iris data set were fed into all networks in the population, as training data.Then each network was simulated using the whole data set.Based on the classification, the performance measure of classification accuracy percentage was introduce into the program.The global best of the population and personal bests of each particle was identified using this performance measure.

Ionosphere Data Classification
This deals with the classification of radar returns from the ionosphere [10]."Good" radar returns are those showing evidence of some type of structure in the ionosphere."Bad" returns are those that do not; their signals pass through the ionosphere.There are 34 continuous input variables in each data set and a total of 351 instances should be classified as either 'good' or 'bad' radar return patterns.Since this data set does not have an equal number of data sets belonging to each of the two classes (there are 225 'good' and 126 'bad' radar return patterns), the first 200 data sets were used as the training set (The 'good' and 'bad' data sets are given alternatively).This data selection method was followed, since past research work which has used this data set in ANN classification experiments, have used this same method [13].

RESULTS AND EVALUATION
Experimental results were obtained for each of the case studies with the parameters set as: Swarm (population) size = 30, c1=c2=2.0Maximum allowed number of hidden layers = 5 Maximum allowed number of nodes in hidden layer = 10 number of hidden layers.In this exercise, the classification accuracy refers to the validation set only.

'Global Best' approach
The highest achievable classification accuracy using this approach was 97.33% (this meant that at least 4 data sets were misclassified during the classification process).An instance in the above table refers to one complete optimisation cycle which concludes by giving the maximum accuracy.All instances in the above table have obtained a classification accuracy of 97.33%.This might be due to that the Iris data set is considered to be a simple classification example, as the three classes are (almost) linearly separable [14].This experimental result can be compared with the results obtained by Van den Bergh and Engelbrecht [14], who have used the Iris data classification case study for their experiments which attempt to improve the performance of the basic PSO by partitioning the input vector into several sub-vectors.They have heuristically chosen an architecture which has 1 hidden layer with 3 hidden nodes, and with this topology, they have achieved only 94% classification accuracy.The 'Global Best' approach has achieved an accuracy of 97.3% for an ANN architecture with 2 hidden layers.
In another research work by Eldracher [15], only 93% maximum accuracy has been obtained for the Iris data set, by using a heuristically chosen, very simple network architecture with no hidden layers and a sigmoid transfer function.He has further suggested that the classification performance could be increased, if a hidden layer is added to the existing network.From the results of the 'Global Best' approach, it can be clearly identified that 2 hidden layered ANN can obtain a higher classification accuracy.
By above results, it can be identified that there is a range for the optimal number of hidden nodes within a network.An accuracy of 97.3% has been achieved in networks which have hidden nodes in the range of 7-17 (irrespective of the total number of hidden layers).According to the facts given by Tan [2], "a Multi-Layer Perceptron network that uses any of a wide variety of continuous nonlinear hidden-layer transfer functions requires just one hidden layer with 'an arbitrarily large number of hidden neurons' to achieve the 'universal approximation' property".Therefore the above mentioned range might be very helpful when deciding a value for this 'arbitrarily large number of hidden neurons'.
In order to check the validity of the above statement, a single hidden layered ANN was constructed, and the classification accuracy and total execution time was recorded by varying the total number of hidden nodes in the single hidden layer.When considering the classification accuracy level in Table 1, even though all 10 instances have obtained an accuracy level of 97.33%, the ANN consisting of two hidden layers with 3 and 4 hidden nodes in each layer respectively, has the lowest number of weights to be trained within this classification problem.This ANN has a total weight density (connection density) of 36, i.e., from input layer to first hidden layer -12 weights, first hidden layer to second hidden layer -12 weights, and from second hidden layer to output layer -12 weights.This is clearly depicted in Fig 5.If only the total number of weights are considered as the deciding factor which contributes to the success of an ANN's performance, then it can be deduced that a single hidden layered ANN which has a similar number of weights (connections) might obtain the same accuracy level.Therefore, the single hidden layered ANN with 5 hidden nodes (= 35 weights) should be able to obtain the same accuracy level as that of the ANN shown in Figure 5.But the accuracy levels shown in Fig 4 clearly proves that the above deduction is false, because the single hidden layered ANN with 5 hidden nodes has never achieved a classification accuracy of 97.33%.Therefore it is clear that apart from the weights, the number of hidden layers in an ANN also has a direct impact on the performance of the ANN.
Instead of an ANN with 5 hidden nodes in one hidden layer, the ANN with 16 hidden nodes in a single hidden layer has obtained a similar classification accuracy as that of the ANN shown in Fig 5 .Even though the single hidden layered ANN with 16 hidden nodes has a higher weight density (112 weights), the large amount of weights that need to be trained does not significantly increase the time taken to obtain its output.By this observation, it can be deduced that instead of a two hidden layered ANN with 3 and 4 hidden nodes respectively, one hidden layered ANN with 16 hidden nodes (can be considered as the 'arbitrary large number of hidden neurons' as stated by Tan [2]), can obtain a similar classification accuracy.The above results obtained from the 'Local Best' do not maintain consistency with the results obtained in the 'Global Best' approach.This could be due to the fact that the population is not subject to a change in the number of hidden layers throughout the execution lifetime.Even the result that 4 or 5 hidden layers also give an accuracy of 97.3% might be directly related to this fact (since the networks do not change their number of hidden layers but only change the number of nodes in a layer, it has the opportunity of trying out a large number of different combinations for the total number of hidden nodes, within a predetermined number of hidden layers).In a previous research which has used PSO to initialize ANN weights [11], a maximum classification accuracy rate of 95.41% has been achieved for the whole data set (training data + test data) in the ionosphere classification problem, by an ANN with 9 hidden units (hidden nodes) and weights which have been initialized using the PSO concept (The number of hidden layers is not specifically mentioned).On the other hand, an ANN having 7 hidden nodes and whose weights were randomly initialized, achieved a classification accuracy of only 94.43% [11].But according to the experimental results shown in Table 3, a maximum classification accuracy of 97.72% has been obtained by an ANN whose structure was evolved using the 'Global Best' approach which implements the PSO algorithm to evolve the ANN structure.This clearly shows the effectiveness of the 'Global best approach'.Table 4 gives the results obtained from the 'Global Best' approach for the test data set only of the Ionosphere data classification problem.

'Global Best' approach
According to Table 4, a maximum classification accuracy of 94.70% was obtained for the test data set only.According to the facts given in the reference work [13], the ionosphere test data set classification carried out by a Multilayer Feed-Forward Network using back propagation, has obtained an average of 96% accuracy on the test instances.Even though it has mentioned that back propagation was tested with several different numbers of hidden units (between 0 and 15), specifications on the total number of hidden layers has not been stated.

Local Best' approach
The classification data given in Table 5, confirms the results given in Table 3 by presenting the fact that ANNs with two hidden layers or four hidden layers have a tendency to give a maximum accuracy level (94.87% is the highest accuracy achieved in the above approach).As shown in the 'Global Best' approach, ANNs with a single hidden layer or with 5 hidden layers, have never succeeded in achieving a maximum accuracy level.

CONCLUSION
The results obtained from the 'Global Best' and the 'Local Best' optimisation approaches suggest that the 'Global Best' approach for adaptive optimisation of ANNs is more successful in obtaining higher accuracy levels.When considering the application case studies, the 'Global Best' approach has achieved a maximum classification accuracy of 97.33% for the Iris Data classification, and 97.72% accuracy on the full data set of the Ionosphere Data classification while achieving a classification accuracy of 94.70% on the test data set of the same case study.When compared with previous research work which has been carried out on the same case studies, the above mentioned accuracy values prove to be better than nearly all of the past results.Therefore it can be concluded that the 'Global Best' approach has the potential to obtain a structurally optimized neural network.
In this research, evolution of only the number of hidden layers and hidden nodes has been considered with regard to the adaptive optimisation of an ANN.But it is well known that these are not the only parameters that can be optimized in a given ANN.Therefore in the future, this research work can include the adaptive optimisation of other ANN parameters like the learning rate, learning momentum and activation functions, in order to realize the goal of achieving a completely optimized ANN.

Figure 1 :
Figure 1:Mapping between PSO and ANN

Figure 2 :
Figure 2: Flow chart for 'Global Best' approach Fig 3 illustrates the above mentioned local best approach.

Figure 3 :
Figure 3: Flow chart for 'Local Best' approach

Figure 4 :
Figure 4: Average accuracy of ANNs with varying number of hidden nodes layer, has given the maximum average accuracy of 95.47%.But when the number of hidden nodes were further increased, the classification accuracy level begins to decrease rapidly.When considering the classification accuracy level in Table1, even though all 10 instances have obtained an accuracy level of 97.33%, the ANN consisting of two hidden layers with 3 and 4 hidden nodes in each layer respectively, has the lowest number of weights to be trained within this classification problem.This ANN has a total weight density (connection density) of 36, i.e., from input layer to first hidden layer -12 weights, first hidden layer to second hidden layer -12 weights, and from second hidden layer to output layer -12 weights.This is clearly depicted inFig 5.

Figure 5 :
Figure 5: Connection (weight) density of the two hidden layered ANN

Table 1 :
Results of Iris data set classification by Global Best approach NO YES

Table 2 :
Results of Iris data set classification by Local Best approach

Table 3 :
Results of Ionosphere data (full set) classification by Global Best approach

Table 4 :
Results of Ionosphere data (test set) classification by Global Best approach

Table 5 :
Results of Ionosphere data (full set) classification by Local Best approach