These are my notes that cover the mathematical foundations of Linear Regression, taken while attending CSCI-UA 9473 - Foundations of Machine Learning at NYU Paris. They make use of probability theory, statistics, calculus, optimization, and linear algebra to formalize the concept of linear regression.
Linear regression is one of the most widely used statistical techniques in data analysis and machine learning. It provides a simple and intuitive linear model for modeling the relationship between a response variable and a set of explanatory variables.
Simple Linear Regression
Least Squares Criterion
where . We want to find such that is minimized.
Therefore, lets compute :
Setting them equal to zero, we get the critical values that minimize :
Replacing the value of , we get:
However, it is commonly written in terms of Pearson's sample correlation coefficient .
For that, a slight modification is required in equation :
Even though we have added the part in green, we can show that the second term is zero as:
Therefore, from equation , we get:
Here, and are standard deviations of and .
is the Pearson's sample correlation coefficient between and defined as:
Multiple Linear Regression
In multiple linear regression, there are multiple independent variables in the equation rather than just one. The mathematics behind multiple linear regression involves using matrix operations to solve for the coefficients of the regression equation.
The regression equation is now represented as:
where are the independent variables and are the coefficients of the equation.
We define and augment each data point as , so that the equation can be written as:
Least Squares Criterion
is the vector of target values.
is the vector of predictions.
is called the design matrix. It is of size .
is the vector of weights.
We want to find the minimizer for , so we solve:
Note that we are implicitly assuming that is an invertible matrix. If either the number of linearly independent examples is less than the number of features, or if the features are not linearly independent, then is not invertible.
This least squared solution is an estimator for . It represents the vector normal to the hyperplane that minimizes the sum of squared errors between the target values and the predictions.
However, this is not enough, we need to be sure of how confident we are in our predictions.
Assume that target values are represented by random variables that are independent and identically distributed (i.i.d.). Then,
where is the vector of true weights that generates the data and is an independent gaussian noise term.
We can write as:
Here, where is the identity matrix.
We can show this as:
is our estimator for . We can write:
Substituting value of from equation , we can write:
Then taking expectation of both sides of the equation , we get:
We can see that our least-squared estimator is unbiased in probability with the true .
Any affine transformation of a gaussian vector is also a gaussian vector:
Then, if is a matrix (that represents a linear transformation):
Then, if :
To compute confidence interval on the weights i.e. on the component of , we define:
Then, we can write, , which implies:
Then, we can write:
For example, say that we want to compute the confidence interval for the component of , then and :
Here, is the confidence interval for the component of .
In practice, we don't know the true and we have to estimate it from the data. We can estimate it using the following formula:
Here, is the number of features and is the number of samples.
The term rather than in the denominator is used to make an unbiased estimate of .
Now, the empirical variance of the residual is nearly equal to . The term is now asymptotically gaussian with mean and variance .
- If the value of is small, it follows a t-distribution (Student distribution). The exact Confidence Intervals can even be derived using Cochran's theorem.
- Our math for confidence intervals relies on the assumption that our data is homoscedastic. In the case that it is not, we can compute the confidence intervals empirically using the bootstrap method, which is based on resampling and the law of large numbers.
Currently, we are using the following loss function:
We can also add a regularization term in the loss function, to prevent overfitting:
This is called Ridge or L2 Regularization. Here, is the regularization parameter. It is a hyperparameter and can be tuned using cross-validation.
Note: We can also use other regularization methods like lasso (L1) regularization, square-root lasso, etc.
Now, to find a value of that minimizes the loss function, we will solve:
Note: Now, even if , is well defined because of the shift.
Let's now look at the impact on the generalization error.
Substituting this value in the above equation, we get:
- is the non-zero bias term, which is deterministic. (It is if .)
- is the variance term.
- is the noise term which is uncontrollable.
Here, if we take the expectation of , we get a non-zero term, which is the bias term. It's a pity because our non-regularized model had an unbiased least squared estimator (expectation was the ground truth). Therefore, by introducing a bias in our model, we are either underestimating or overestimating the ground truth.
The interest of introducing a bias in our model is that it enables us to control the variance term. Therefore, with a small , we have a high variance and a low bias. On the other hand, with a large , we have a low variance and a high bias. We call this tradeoff the bias-variance tradeoff.