Of the named three procedures ` RBF_Weights` is the most
comprehensive one. Here all necessary initialization tasks (setting
link weights and bias) for a fully connected three layer feedforward
network (without shortcut connections) can be performed in one single
step. Hence, the choice of centers (i.e. the link weights between
input and hidden layer) is rather simple: The centers are evenly
selected from the loaded teaching patterns and assigned to the links
of the hidden neurons. The selection process assigns the first
teaching pattern to the first hidden unit, and the last pattern to the
last hidden unit. The remaining hidden units receive centers which are
evenly picked from the set of teaching patterns. If, for example, 13
teaching patterns are loaded and the hidden layer consists of 5
neurons, then the patterns with numbers 1, 4, 7, 10 and 13 are
selected as centers.

Before a selected teaching pattern is distributed among the
corresponding link weights it can be modified slightly with a random
number. For this purpose, an initialization parameter (*
deviation*, parameter 5) is set, which determines
the maximum percentage of deviation allowed to occur randomly. To
calculate the deviation, an inverse tangent function is used to
approximate a normal distribution so that small deviations are more
probable than large deviations. Setting the parameter * deviation*
to 1.0 results in a maximum deviation of 100%. The centers are copied
unchanged into the link weights if the deviation is set to 0.

A small modification of the centers is recommended for the following reasons: First, the number of hidden units may exceed the number of teaching patterns. In this case it is necessary to break the symmetry which would result without modification. This symmetry would render the calculation of the Moore Penrose inverse matrix impossible. The second reason is that there may be a few anomalous patterns inside the set of teaching patterns. These patterns would cause bad initialization results if they accidentally were selected as a center. By adding a small amount of noise, the negative effect caused by anomalous patterns can be lowered. However, if an exact interpolation is to be performed no modification of centers may be allowed.

The next initialization step is to set the free parameter **p** of the
base function **h**, i.e. the bias of the hidden neurons. In order to
do this, the initialization parameter * bias (p)* is directly
copied into the bias of all hidden neurons. The setting of the bias is
highly related to the base function **h** used and to the properties of
the teaching patterns. When the Gaussian function is used, it is
recommended to choose the value of the bias so that 5--10% of all
hidden neurons are activated during propagation of every single teaching
pattern. If the bias is chosen too small, almost all hidden neurons are
uniformly activated during propagation. If the bias is chosen too large,
only that hidden neuron is activated whose center vector corresponds
to the currently applied teaching pattern.

Now the expensive initialization of the links between hidden and output layer is actually performed. In order to do this, the following formula which was already presented above is applied:

The initialization parameter 3 (* smoothness*) represents the value
of in this formula. The matrices have been extended to allow
an automatic computation of an additional constant value. If there is
more than one neuron inside the output layer, the following set of
functions results:

The bias of the output neuron(s) is directly set to the calculated value
of **b** (). Therefore, it is necessary to choose an
activation function for the output neurons that uses the bias of
the neurons. In the current version of SNNS, the functions `
Act_Logistic` and ` Act_IdentityPlusBias` implement this feature.

The activation functions of the output units lead to the remaining two initialization parameters. The initialization procedure assumes a linear activation of the output units. The link weights are calculated so that the weighted sum of the hidden neurons equals the teaching output. However, if a sigmoid activation function is used, which is recommended for pattern recognition tasks, the activation function has to be considered during initialization. Ideally, the supposed input for the activation function should be computed with the inverse activation function depending on the corresponding teaching output. This input value would be associated with the vector during the calculation of weights. Unfortunately, the inverse activation function is unknown in the general case.

The first and second initialization parameters (* 0_scale*) and
(* 1_scale*) are a remedy for this dilemma. They define the two
control points of a piecewise linear function which approximates the
activation function. * 0_scale* and * 1_ scale* give the net
inputs of the output units which produce the teaching outputs **0** and
**1**. If, for example, the linear activation function `
Act_IdentityPlusBias` is used, the values 0 and 1 have to be used.
When using the logistic activation function ` Act_Logistic`, the
values -4 and 4 are recommended. If the bias is set to 0, these
values lead to a final activation of (resp. ). These
are comparatively good approximations of the desired teaching outputs
0 and 1. The implementation interpolates linearly between the set
values of * 0_scale* and * 1_scale*. Thus, also teaching
values which differ from **0** and **1** are mapped to corresponding input
values.

**Figure:** Relation between teaching output, input value and logistic activation

Figure shows the activation of an output unit under
use of the logistic activation function. The scale has been chosen in
such a way, that the teaching outputs **0** and **1** are mapped to the input
values **-2** and **2**.

The optimal values used for * 0_scale* and * 1_scale* can not
be given in general. With the logistic activation function large
scaling values lead to good initialization results, but interfere with
the subsequent training, since the logistic function is used mainly in
its very flat parts. On the other hand, small scaling values lead
to bad initialization results, but produce good preconditions for
additional training.

Niels.Mache@informatik.uni-stuttgart.de

Tue Nov 28 10:30:44 MET 1995