Explain the architecture of Recurrent Neural Networks in detail.

Medium Last updated on Jan. 12, 2023, 6:57 a.m.

Recurrent Neural Networks (RNNs) are a type of neural network that is capable of processing sequential data, such as time series or natural language. The basic architecture of an RNN is a simple feedforward neural network with a “memory” or “context” unit that allows the network to retain information from previous time steps.

The core building block of an RNN is the recurrent neuron, which has an internal state (or memory) that is updated at each time step. The input to the recurrent neuron at time t is a combination of the current input and the previous internal state, which is then passed through a non-linear activation function. The output of the recurrent neuron is then used as input for the next time step, and also becomes the new internal state.

RNNs can be unrolled in time, meaning that the recurrent connections between neurons are represented as a chain of identical copies of the network. This unrolled network can then be trained using traditional supervised learning techniques such as backpropagation through time (BPTT) or truncated backpropagation through time (TBPTT).

Markdown Monster icon
hidden layer representation of RNN, LSTM, and GRU

There are many variations of RNNs, such as the Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU) networks, which have been designed to overcome the vanishing gradient problem that can occur when training standard RNNs. These variants have additional gates or units that help to control the flow of information through the network and allow it to better retain information over longer time periods.

In summary, RNNs are neural networks that have a recurrent connection that allows it to retain information across time steps. They can be unrolled in time and trained using supervised learning techniques. Variants of RNNs like LSTM and GRU are designed to overcome the vanishing gradient problem.

What is Vanishing Gradient Problem? and what steps can we take mitigate it?

The vanishing gradient problem is an issue that can occur when training deep neural networks, particularly recurrent neural networks (RNNs). It refers to the phenomenon where the gradients of the network weights, which are used to update the weights during training, become very small as they are backpropagated through many layers of the network. This can make it difficult for the network to learn, as the weight updates become very small, and training can take a very long time or even become impossible.

The problem occurs because the gradients are multiplied by the weights of the layers as they are backpropagated through the network. If the weights of the layers are not well initialized or are too small, the gradients can become very small after being backpropagated through many layers. This is particularly problematic in RNNs because the gradients must be backpropagated through many time steps, which can lead to an even greater reduction in gradient size.

There are several ways to mitigate the vanishing gradient problem, such as
1. Using techniques like Gradient Clipping to prevent gradients from becoming too large or too small.
2. Using architectures like Long Short-term Memory (LSTM) or Gated Recurrent Unit (GRU) which have been designed to overcome this issue by controlling the flow of information through the network.
3. Other techniques include using ReLU activation function, Batch Normalization, and Residual Connections.

For More details on hidden layer working, the best blog to read is by Christopher Olah. Checkout Understanding LSTM Networks