Explain Logistic Regression and it's optimization.

Medium Last updated on May 7, 2022, 1:10 a.m.

Logistic Regression is a form of Discriminative Classifier. In Discriminative Classifiers We only focus on P(Y/X) i.e; we make an assumption about the Probabilistic Distribution of Labels(Y), and try to somehow map our Input(X) to it. So, essentially what we try to do in Logistic Regression is:

$$ f_{LR}(X)=argmax_{c \epsilon y}P(Y=c|x) $$

Here, as we know Logistic Regression is a Classification Model, so we need to define the distribution of something of Categorical Nature. For Binary Classification, we can assume it to be Bernoulli Distribution, whereas, for Multi-Class Classification, it will be Multinoulli Distribution. Just to make derivation simpler, we will assume that we have to build Binary Classification Model, which means The probability of a class is either Θ, if y=1, or 1 − Θ, if y = 0. The likelihood is then:

P = \theta^y * (1-\theta)^{(1-y)}
Here Θ is a sigmoid function,
$$ \theta=\frac{1}{1+exp(-wx-b)} $$

on solving sigmoid function further we get:

$$ \theta * (1+exp(-wx-b)) = 1 $$
$$ exp(-wx-b) = \frac{1-\theta}{\theta} $$
by taking log on both sides we get:
$$ wx+b = log(\frac{\theta}{1-\theta})$$

Markdown Monster icon
Logistic Function

What Algorithm to use to find optimal parameters of Logistic Regression?

In Theory, Logistic Regression can be optimized using the Maximum Likelihood Estimation and Gradient Descent(L-BFGS) method. But in the real world, it is converged using the Gradient Descent method as Maximum Likelihood Estimation fails to provide any closed-form solution. We can prove this theoretically like this:

$$ Loss = \prod_{i=1}^{n} \theta_{i}^{y_i}*(1-\theta_i)^{(1-y_i)}$$

Using Maximum Likelihood Estimation for optimization we get:
$$ Loss = \sum_{i=1}^{n}log(1-\theta_i)+y_i * log(\frac{\theta_i}{1-\theta_i})$$
using definition of Θ we will get:
$$ Loss = \sum_{i=1}^{n} -log(1+exp(-wx_i-b)) + y_i * (wx_i+b))$$

now to find optimal parameters we will have to differentiate them with respect to w and b, but when we differentiate them, what we observe is:

$$ \frac{\partial Loss}{\partial w} = \sum_{i=1}^{n} -\frac{exp(wx_i+b) * x_i}{1+exp(wx_i+b)}+y_i * x_i $$

we can’t set it to 0, to obtain optimal parameters, as this is a transcendental equation, and there is no closed-form solution for it. We can however approximately solve it numerically using Gradient Descent with Cross-Entropy as Loss Function.

For details on how to solve Logistic Regression, please check out the Logistic Regression From Scratch Coding Exercise.