Medium Last updated on May 7, 2022, 1:17 a.m.

Activation functions as we know are the non-linearities introduced in neural network architecture to enable complex pattern learning. Without non-linearities, neural networks are essentially linear models. There are some well-known activation functions widely used in neural networks including ReLU, tanh, and sigmoid. Lets’ take the example of the Sigmoid activation function to understand the necessity of zero-centered non-linearity.

As shown in the figure above, Sigmoid function outputs(y-axis) are all positive, therefore not zero-centered. Similarly, the ReLU activation function also has all positive output. Now, we also know that Neural Networks use gradient descent algorithms for optimization, and we generally normalize our data to make it zero-centered.

Say, our input is X is cleaned and normalized with values ranging from (-1,1). We forward pass this input X, but even if our input is zero centered, while forward pass the non-zero centered activation layers will output all positive data. This will have implications on the dynamics during backpropagation of gradient descent, because if the data coming into a neuron is always positive (e.g. x>0 elementwise in f=Wx+b)); the gradient on the weights `w`

during backpropagation become either all

be positive, or all negative (depending on the gradient of the whole

expression f) as shown below.

All positive or all negative gradients could introduce undesirable zig-zag dynamics in the gradient updates for the weights, in simpler terms, it’s not a smooth path towards optimization. This is also the same reason why we need zero-centered data.

**Is there any solution to resolve this situation? **

One way to solve this problem is by using mini-batch gradient descent. Once the gradients are added up across a batch of data the final update for the weights can have variable signs, somewhat mitigating this issue. In short, having non-zero-centric activation functions is an inconvenience but it has less severe consequences compared to the saturated activation problem.

Frequently Asked Questions by

IBM