Medium Last updated on May 7, 2022, 1:10 a.m.
Logistic Regression is a form of Discriminative Classifier. In Discriminative Classifiers We only focus on P(Y/X) i.e; we make an assumption about the Probabilistic Distribution of Labels(Y), and try to somehow map our Input(X) to it. So, essentially what we try to do in Logistic Regression is:
$$ f_{LR}(X)=argmax_{c \epsilon y}P(Y=c|x) $$
Here, as we know Logistic Regression is a Classification Model, so we need to define the distribution of something of Categorical Nature. For Binary Classification, we can assume it to be Bernoulli Distribution, whereas, for Multi-Class Classification, it will be Multinoulli Distribution. Just to make derivation simpler, we will assume that we have to build Binary Classification Model, which means The probability of a class is either Θ, if y=1, or 1 − Θ, if y = 0. The likelihood is then:
\begin{equation}
P = \theta^y * (1-\theta)^{(1-y)}
\label{eq1}
\end{equation}
Here Θ
is a sigmoid function,
$$ \theta=\frac{1}{1+exp(-wx-b)} $$
on solving sigmoid function further we get:
$$ \theta * (1+exp(-wx-b)) = 1 $$
$$ exp(-wx-b) = \frac{1-\theta}{\theta} $$
by taking log
on both sides we get:
$$ wx+b = log(\frac{\theta}{1-\theta})$$
What Algorithm to use to find optimal parameters of Logistic Regression?
In Theory, Logistic Regression can be optimized using the Maximum Likelihood Estimation and Gradient Descent(L-BFGS) method. But in the real world, it is converged using the Gradient Descent method as Maximum Likelihood Estimation fails to provide any closed-form solution. We can prove this theoretically like this:
$$ Loss = \prod_{i=1}^{n} \theta_{i}^{y_i}*(1-\theta_i)^{(1-y_i)}$$
Using Maximum Likelihood Estimation for optimization we get:
$$ Loss = \sum_{i=1}^{n}log(1-\theta_i)+y_i * log(\frac{\theta_i}{1-\theta_i})$$
using definition of Θ
we will get:
$$ Loss = \sum_{i=1}^{n} -log(1+exp(-wx_i-b)) + y_i * (wx_i+b))$$
now to find optimal parameters we will have to differentiate them with respect to w
and b
, but when we differentiate them, what we observe is:
$$ \frac{\partial Loss}{\partial w} = \sum_{i=1}^{n} -\frac{exp(wx_i+b) * x_i}{1+exp(wx_i+b)}+y_i * x_i $$
we can’t set it to 0, to obtain optimal parameters, as this is a transcendental equation, and there is no closed-form solution for it. We can however approximately solve it numerically using Gradient Descent with Cross-Entropy as Loss Function.
For details on how to solve Logistic Regression, please check out the Logistic Regression From Scratch Coding Exercise.