# Why we don't use sigmoid activation function for all layers??

Easy Last updated on May 7, 2022, 1:11 a.m.

Sigmoid Activation Function is an exponential form of non-linearity leading to positive outputs. As we can see in the figure below, the sigmoid function value ranges between [0,1].

Sigmoid Activation function is widely used as the last layer of neural networks, for their output range of [0 1] that can be used as a probabilistic representation. However, Sigmoid Function has the problem of saturation i.e; Mathematically speaking.

$$\sigma(x) = \frac{1}{1+e^{-(wx+b)}}$$

As we know sigmoid function has a great differentiation property.
$$\sigma’(x) = \sigma(x) (1-\sigma(x))$$

This will result in two cases:

• if $(wx+b) \approx \infty$ the sigmoid function becomes
$$\sigma(wx) = \frac{1}{1+e^{- \infty }}$$
$$\sigma (wx) = 1$$

Therefore, using the derivative equation above we will get:
$$\sigma’ (wx) = 1*(1-1)$$

• if $(wx+b) \approx - \infty$ the sigmpid function becomes
$$\sigma(wx) = \frac{1}{1+e^{\infty }}$$
$$\sigma(wx) = 0$$

similarly, using the derivative equation above we will get:
$$\sigma’ (wx) = 0*(1-0)$$

In both above cases, we can observe a saturation of outcome as well as gradients. Smaller gradients can lead to slow convergence to optima when using the gradient descent algorithm.