This is a learning algorithm for recurrent networks that are updated
in discrete time steps (non-fixpoint networks). These networks may
contain any number of feedback loops in their connectivity graph. The
only restriction in this implementation is that there may be no
connections between input units. The gradients of the weights in the
recurrent network are approximated using an feedforward network with a
fixed number of layers. Each layer **t** contains all activations
of the recurrent network at time step **t**. The highest layer
contains the most recent activations at time **t=0**. These activations
are calculated synchronously, using only the activations at **t=1** in
the layer below. The weight matrices between successive layers are
all identical. To calculate an exact gradient for an input pattern
sequence of length **T**, the feedforward network needs **T+1** layers if
an output pattern should be generated after the last pattern of the
input sequence. This transformation of a recurrent network into a
equivalent feedforward network was first described in
[MP69], p. 145 and the application of backpropagation
learning to these networks was introduced in [RHW86].

To avoid deep networks for long sequences, it is possible to use only
a fixed number of layers to store the activations back in time. This
method of * truncated backpropagation through time* is described in
[Zip90] and is used here. An improved feature in this
implementation is the combination with the quickprop algorithm by
[Fah88] for weight adaption. The number of additional copies
of network activations is controlled by the parameter ` backstep`.
Since the setting of ` backstep` virtually generates a hierarchical
network with ` backstep` **+ 1** layers and error information during
backpropagation is diminished very rapidly in deep networks, the
number of additional activation copies is limited by ` backstep`
.

There are three versions of backpropagation through time available:

- BPTT:
- Backpropagation through time with online-update.

The gradient for each weight is summed over`backstep`copies between successive layers and the weights are adapted using the formula for backpropagation with momentum term after each pattern. The momentum term uses the weight change during the previous pattern. Using small learning rates`eta`,`BPTT`is especially useful to start adaption with a large number of patterns since the weights are updated much more frequently than in batch-update. - BBPTT:
- Batch backpropagation through time.

The gradient for each weight is calculated for each pattern as in`BPTT`and then averaged over the whole training set. The momentum term uses update information closer to the true gradient than in`BPTT`. - QPTT:
- Quickprop through time.

The gradient in quickprop through time is calculated as in`BBPTT`, but the weights are adapted using the substantially more efficient quickprop-update rule.

A recurrent network has to start processing a sequence of patterns
with defined activations. All activities in the network may be set to
zero by applying an input pattern containing only zero values. If such
all-zero patterns are part of normal input patterns, an extra input
unit has to be added for reset control. If this reset unit is set to
**1**, the network is in the free running mode. If the reset unit *
and* all normal input units are set to **0**, all activations in the
network are set to **0** and all stored activations are cleared as well.

The processing of an input pattern with a set of non-input activations is performed as follows:

- The input pattern is copied to the input units to become a
subset of the existing unit activations of the whole net.
- If contains only zero activations, all activations
and all stored activations are set to .
- All activations are calculated synchronously using
the activation function and activation values .
- During learning, an output pattern is always compared with
the output subset of the
*new*activations .

Therefore there is exactly * one* synchronous update step between
an input and an output pattern with the same pattern number.

If an input pattern has to be processed with more than one network
update, there has to be a delay between corresponding input and output
patterns. If an output pattern is the **n**-th pattern after an
input pattern , the input pattern has been processed in **n+1**
update steps by the network. These **n+1** steps may correspond to **n**
hidden layers processing the pattern or a recurrent processing path
through the network with **n+1** steps. Because of this pipelined
processing of a pattern sequence, the number of hidden layers that may
develop during training in a fully recurrent network is influenced by
the delay between corresponding input and output patterns. If the
network has a defined hierarchical topology without shortcut
connections between **n** different hidden layers, an output pattern
should be the **n**-th pattern after its corresponding input pattern in
the pattern file.

An example illustrating this relation is given with the delayed XOR
network in the network file ` xor-rec.net` and the pattern files `
xor-rec1.pat` and ` xor-rec2.pat`. With the patterns `
xor-rec1.pat`, the task is to compute the XOR function of the
previous input pattern. In ` xor-rec2.pat`, there is a delay of
2 patterns for the result of the XOR of the input pattern. Using a
fixed network topology with shortcut connections, the BPTT learning
algorithm develops solutions with a different number of processing
steps using the shortcut connections from the first hidden layer to
the output layer to solve the task in ` xor-rec1.pat`. To map the
patterns in ` xor-rec2.pat` the result is first calculated in the
second hidden layer and copied from there to the output layer during
the next update step.

The update function ` BPTT-Order` performs the synchronous update
of the network and detects reset patterns. If a network is tested
using the button in the control panel, the internal
activations and the output activation of the output units are first
overwritten with the values in the target pattern, depending on the
setting of the button . To provide correct activations on
feedback connections leading out of the output units in the following
network update, all output activations are copied to the units initial
activation values ` i_act` after each network update and are
copied back from ` i_act` to ` out` before each update. The
non-input activation values may therefore be influenced before a
network update by changing the initial activation values ` i_act`.

If the network has to be reset by stepping over a reset pattern with the button, keep in mind that after clicking , the pattern number is increased first, the new input pattern is copied into the input layer second, and then the update function is called. So to reset the network, the current pattern must be set to the pattern directly preceding the reset pattern.

Niels.Mache@informatik.uni-stuttgart.de

Tue Nov 28 10:30:44 MET 1995