Medium Last updated on May 7, 2022, 1:06 a.m.
Weight Initialization is an important step in Neural Networks, It can make Networks converge faster, and make better predictions. There are some methods that exist and have been implemented by Deep Learning Libraries as well, but the question still remains, which one to use as sometimes we don’t have the cost and time to experiment with everything.
1. Zero Initialization: If we go for Zero-weight initialization, then during backpropagation we won’t be able to learn anything. Thus Initializing it either with zero weights or random values without any bounds could lead to overflow or underflow situations.
One can also think of initializing weights with random initialization from uniform distribution as follows:
This works ~okay for small networks but can lead to non-homogeneous distributions of activation across the layers of a network.
2. Random Normal Initialization: In this we initialize the weights of the neurons to small numbers in a symmetric fashion. The idea is that the neurons are all random and unique in the beginning, so they will compute distinct updates and integrate themselves as diverse parts of the full network. We use Gaussian Distribution varying from (-1,1), with mean=0, and deviation=0.1.
With this formulation, every neuron’s weight vector is initialized as a random vector sampled from a multi-dimensional Gaussian, so the neuron’s point in a random direction is in the input space.
One problem with the above approach is that the distribution of the outputs from a randomly initialized neuron has a variance that grows with the number of inputs. This can be solved via using Xavier Initialization.
3. Xavier Initialization: Xavier initialization also helps signals reach deep into the network.
(i) If the weights in a network start too small, then the signal shrinks as it passes through each layer until it’s too tiny to be useful.
(ii) If the weights in a network start too large, then the signal grows as it passes through each layer until it’s too massive to be useful.
Xavier initialization makes sure the weights are proper, keeping the signal in a reasonable range of values through many layers.
here, Nin
= number of neurons feeding into it
and Nout
= the number of neurons the result is fed to.
4. Lasange Initialization: This approach also deals with variance issues, but via using Uniform Distribution instead of Gaussian-like Xavier does. Here,
here, Nin
= number of neurons feeding into it
and Nout
= the number of neurons the result is fed to.
This method is mainly used for Sigmoid and Hyperbolic Tangent Layers. For RELU, Xavier works best.