Recurrent neural network
A major limitation of the feed-forward MLP architecture is that all examples must have the same width. This limitation can be overcome using various recurrent architectures.
In this example, we’ll build a single-layer recursive neural net. We’ll reuse the simulated toy data that we trained the MLP on, which happen to all have the same width, but generally recurrent neural nets will be used on sequences of varying lengths.
nngraph
We did reasonably well using the stock nn
library to build a simple MLP. However, for more complicated architectures, the nngraph
package offers an additional layer abstraction that proves quite useful. If you’re not familiar with nngraph
you may want to see my brief introduction.
RNN implementation
Here’s what our toy RNN should look like unrolled:
The components that nn
uses to build graphs are a bit more basic, and thus more flexible. The overall topology is the same, but there’s a few more pieces so that the tensors and tables are manipulated in the expected way. Here’s a visualization of the same RNN but using nn
notation:
In practice, we want to abstract the length of the sequence to be any arbitrary length (up to a pre-defined maximum). So we’ll code building the RNN with a loop. Here we set len=3
so that the result matches the figure:
10 lines of code isn’t bad for an RNN, no? For our actual implementation, we’ll need to take care of a few more details. These include:
- Setting up the network to work with mini-batches.
- Sharing (tying) parameters among RNN units.
- Getting our parameters all into one contiguous, linear vector for easy training updates.
Mini-batches
In some cases, nn.Modules
work with mini-batches or single examples. This is sometimes possible by inspecting dimensions and lengths of tensors and assuming that if a tensor is one dimension short and too long, and would fit nicely by adding an extra dimension, it is assumed that this extra dimension iterates over the batches.
However in some cases additional explicit arguments are needed. For example, below we’ll pass an extra argument to JoinTable(2,2)
. This is all documented or readily apparent from the nn
source.
Parameter sharing
All nn.Module
instances can return a table of parameter tensors using mod:parameters()
. nngraph
nn.gModule
instances also have a mod:parameter()
function, but instead it returns a table of all parameter tensors for all modules in the graph.
When we inspect the parameters, we’ll see that the first two tensors look like the weight matrix and bias vector for the Linear(9,8)
module. We have three such maps, explaining the first six. The final two are the weight matrix and bias vector for the Linear(8,1)
mapping the terminal hidden state to the output variable, y
.
th> p, gp = rnn:parameters() th> p { 1 : DoubleTensor - size: 8x9 2 : DoubleTensor - size: 8 3 : DoubleTensor - size: 8x9 4 : DoubleTensor - size: 8 5 : DoubleTensor - size: 8x9 6 : DoubleTensor - size: 8 7 : DoubleTensor - size: 1x8 8 : DoubleTensor - size: 1 }
Two values are returned from parameters()
, the second is a table of all the gradient parameters. These also need to be shared so that as each unit accumulates the gradient during back-propagation, they’re all accumulating the gradient of one set of parameters.
We can tie the parameters together as follows:
Vectorizing parameters
In order to allow one training algorithm to work with pretty much any network architecture, we simply need to vectorize our parameters. Since all the parameters are unrolled into a single vector, we don’t need to know anything about the topology.
We can get parameters as follows:
We have one vector for the parameters and one vector for the gradient of the objective wrt the parameters. This function allocates a single contiguous chunk of memory for all the parameter tensors. Then the original parameter tensors are pointed back to this memory (which is just a matter of configuring the pointer and the stride). Because new memory is allocated, be sure not to run this function more than once, as any old references will no longer point to the actual tensors used.
Having vectorized parameters makes it easy to assign an initial guess:
And easy to perform an update:
Full implementation
Now with the above considerations in mind, here’s the crux of the RNN setup code, which is part of a larger script that handles data prep and draining.
How does it fare? Well, when we train it as follows:
th model-1_layer.lua -hidden 26 -batch 20 -rate 0.02 -iter 30
We can see it does quite well. The model has 755 parameters, which is comparable to our MLP implementations.
Now that we have a simple RNN under our belt, lets enhance it a bit by turning it into an LSTM.