# What is Linear Regression? What are the assumptions of a Linear Regression Model?

Medium Last updated on May 7, 2022, 1:25 a.m.

Linear regression is a parametric regression method that assumes the relationship between y(observed output) and x(observed input) is a linear function with w and b parameters. The equation of Linear Regression can be defined as:

$$f_{Lin}(x)= \sum_{d=1}^{D}w_{d}x_{d} +b$$

Here d represents the feature dimensions of input.

Note: As Linear regression assumes the linear relation, it is a high bias and low variance modeling approach.

## Capacity Control measures for Linear Regression:

Similar to classification, regression models also require capacity control to avoid overfitting and numerical stability problems in high dimensions. This can be accomplished by:

1. Basis Expansion: Linear regression models can be extended to capture non-linear relationships by using basis function expansions. In this, the X variable in equation Y=wX+b, gets converted into a non-linear function (an exponential function, a polynomial function, etc).

2. Regularization: regularizing the weight parameters during learning also helps in tuning model capacity. To learn more, read our blog on What is Regularization? When do we need it?

### How can we learn the parameter values w and b?

#### Ordinary Least Square Method

OLS selects the linear regression parameters to minimize the mean squared error (MSE) on the training data set. To optimization equation can be written as:
$$w^\ast , b^\ast = argmin_{w, b} \frac{1}{N} \sum_{i=1}^{N}(y_{i}-x_{i}w+b)^2$$

To solve this equation mathematically, we can assume that X is a data matrix with one data case per row, and Y is a column vector containing the corresponding outputs. The above optimization equation can be modified to:

$$w^\ast = argmin_{w} \frac{1}{N} (Y-Xw)^{T}(Y-Xw)$$

By taking first-order derivative and setting it to zero, we get:

$$0 = \frac{\partial }{\partial w} \frac{1}{N} (Y-Xw)^{T}(Y-Xw)$$
$$X^{T} (Y-Xw) = 0$$
$$X^{T}Xw = X^{T}Y$$
Therefore, Optimal w can be defined as:
$$w^\ast = (X^{T}X)^{-1} X^{T}Y$$

## Limitations of Linear Regression:

1. Number of Features<= Number of Data points: Linear Regression needs at least D data cases to learn a model with a D dimensional feature vector. Otherwise, the inverse of $X^T X$ is not defined in the optimal w equation.

2. Sensitive to co-linear features: co-linear features can be mathematically defined as: $feature1 = a*feature2+b$. In the case of co-linear features inverse of $X^T X$ becomes numerically unstable.

3. Computation is cubic in data dimension D.

4. Very sensitive to noise and outliers due to MSE objective function/Normally distributed residuals assumption.