# Explain Logistic Regression and it's optimization.

Medium Last updated on May 7, 2022, 1:10 a.m.

Logistic Regression is a form of Discriminative Classifier. In Discriminative Classifiers We only focus on P(Y/X) i.e; we make an assumption about the Probabilistic Distribution of Labels(Y), and try to somehow map our Input(X) to it. So, essentially what we try to do in Logistic Regression is:

$$f_{LR}(X)=argmax_{c \epsilon y}P(Y=c|x)$$

Here, as we know Logistic Regression is a Classification Model, so we need to define the distribution of something of Categorical Nature. For Binary Classification, we can assume it to be Bernoulli Distribution, whereas, for Multi-Class Classification, it will be Multinoulli Distribution. Just to make derivation simpler, we will assume that we have to build Binary Classification Model, which means The probability of a class is either Θ, if y=1, or 1 − Θ, if y = 0. The likelihood is then:

\begin{equation}
P = \theta^y * (1-\theta)^{(1-y)}
\label{eq1}
\end{equation}
Here Θ is a sigmoid function,
$$\theta=\frac{1}{1+exp(-wx-b)}$$

on solving sigmoid function further we get:

$$\theta * (1+exp(-wx-b)) = 1$$
$$exp(-wx-b) = \frac{1-\theta}{\theta}$$
by taking log on both sides we get:
$$wx+b = log(\frac{\theta}{1-\theta})$$

What Algorithm to use to find optimal parameters of Logistic Regression?

In Theory, Logistic Regression can be optimized using the Maximum Likelihood Estimation and Gradient Descent(L-BFGS) method. But in the real world, it is converged using the Gradient Descent method as Maximum Likelihood Estimation fails to provide any closed-form solution. We can prove this theoretically like this:

$$Loss = \prod_{i=1}^{n} \theta_{i}^{y_i}*(1-\theta_i)^{(1-y_i)}$$

Using Maximum Likelihood Estimation for optimization we get:
$$Loss = \sum_{i=1}^{n}log(1-\theta_i)+y_i * log(\frac{\theta_i}{1-\theta_i})$$
using definition of Θ we will get:
$$Loss = \sum_{i=1}^{n} -log(1+exp(-wx_i-b)) + y_i * (wx_i+b))$$

now to find optimal parameters we will have to differentiate them with respect to w and b, but when we differentiate them, what we observe is:

$$\frac{\partial Loss}{\partial w} = \sum_{i=1}^{n} -\frac{exp(wx_i+b) * x_i}{1+exp(wx_i+b)}+y_i * x_i$$

we can’t set it to 0, to obtain optimal parameters, as this is a transcendental equation, and there is no closed-form solution for it. We can however approximately solve it numerically using Gradient Descent with Cross-Entropy as Loss Function.

For details on how to solve Logistic Regression, please check out the Logistic Regression From Scratch Coding Exercise.