These are my notes that cover the mathematical foundations of Linear Regression, taken while attending CSCI-UA 9473 - Foundations of Machine Learning at NYU Paris. They make use of probability theory, statistics, calculus, optimization, and linear algebra to formalize the concept of linear regression.


Linear regression is one of the most widely used statistical techniques in data analysis and machine learning. It provides a simple and intuitive linear model for modeling the relationship between a response variable and a set of explanatory variables.

Simple Linear Regression

Least Squares Criterion

JN(w)=12Ni=1N(yiw1xiw0)2\begin{equation} J_{\tiny N}(\mathbf{w}) = \frac{1}{2N} \sum_{i=1}^N \left( y_i - w_1x_i - w_0 \right)^2 \end{equation}

where w=(w0w1)\mathbf{w} = \begin{pmatrix} w_0 \\ w_1 \end{pmatrix}. We want to find w\mathbf{w} such that JN(w)J_{\tiny N}(\mathbf{w}) is minimized.

Therefore, lets compute JN(w)\nabla J_{\tiny N}(\mathbf{w}):

JN(w)=[JNw0(w)JNw1(w)]\begin{equation} \nabla J_{\tiny N}(\mathbf{w}) = \left[\begin{array}{c} \dfrac{\partial J_{\tiny N}}{\partial w_0}(\left.\mathbf{w}\right)\\ \\ \dfrac{\partial J_{\tiny N}}{\partial w_1}(\left.\mathbf{w}\right)\\ \end{array}\right] \end{equation}


JNw0(w)=1Ni=1N(yiw1xiw0)JNw1(w)=1Ni=1N(yiw1xiw0)xi\begin{align} \dfrac{\partial J_{\tiny N}}{\partial w_0}(\left.\mathbf{w}\right) &= -\frac{1}{N} \sum_{i=1}^N \left( y_i - w_1x_i - w_0 \right)\\ \dfrac{\partial J_{\tiny N}}{\partial w_1}(\left.\mathbf{w}\right) &= -\frac{1}{N} \sum_{i=1}^N \left( y_i - w_1x_i - w_0 \right)x_i \end{align}

Setting them equal to zero, we get the critical values (w0,w1)(w_0, w_1) that minimize JN(w)J_{\tiny N}(\mathbf{w}):

JNw0(w)=0\begin{equation*} \dfrac{\partial J_{\tiny N}}{\partial w_0}(\left.\mathbf{w}\right) = 0 \end{equation*}
    i=1N(yiw1xiw0)=0    w0=1Ni=1N(yiw1xi)    w0=(1Ni=1Nyi)w1(1Ni=1Nxi)\begin{align*} &\implies \sum_{i=1}^N \left( y_i - w_1x_i - w_0 \right) = 0\\ &\implies w_0 = \frac{1}{N} \sum_{i=1}^N \left( y_i - w_1x_i \right)\\ &\implies w_0 = \left(\frac{1}{N} \sum_{i=1}^N y_i \right) - w_1 \left(\frac{1}{N} \sum_{i=1}^N x_i\right)\\ \end{align*}
w0=yw1x\begin{equation} \therefore w_0 = \overline{y} - w_1 \overline{x} \end{equation}


JNw1(w)=0\begin{equation*} \dfrac{\partial J_{\tiny N}}{\partial w_1}(\left.\mathbf{w}\right) = 0 \end{equation*}
    i=1N(yiw1xiw0)xi=0\begin{align*} &\implies \sum_{i=1}^N \left( y_i - w_1x_i - w_0 \right)x_i = 0\\ \end{align*}

Replacing the value of w0w_0, we get:

    i=1N(yiw1xi(yw1x))xi=0    i=1N(yiy)xiw1i=1N(xix)xi=0    w1=i=1N(yiy)xii=1N(xix)xi\begin{align} &\implies \sum_{i=1}^N \left( y_i - w_1x_i - \left(\overline{y} - w_1 \overline{x}\right) \right)x_i = 0\\ &\implies \sum_{i=1}^N \left( y_i - \overline{y} \right)x_i - w_1 \sum_{i=1}^N \left( x_i - \overline{x} \right)x_i = 0\\ &\implies w_1 = \frac{\displaystyle\sum_{i=1}^N \left( y_i - \overline{y} \right)x_i}{\displaystyle\sum_{i=1}^N \left( x_i - \overline{x} \right)x_i}\\ \end{align}

However, it is commonly written in terms of Pearson's sample correlation coefficient rxyr_{xy}.

For that, a slight modification is required in equation (6)(6):

    i=1N(yiw1xi(yw1x))(xix)+i=1N(yiw1xi(yw1x))x=0\begin{align} &\implies \sum_{i=1}^N \left( y_i - w_1x_i - \left(\overline{y} - w_1 \overline{x}\right) \right) \textcolor{#01B636}{(}x_i \textcolor{#01B636}{- \overline{x}) + \sum_{i=1}^N \left( y_i - w_1x_i - \left(\overline{y} - w_1 \overline{x}\right) \right) \overline{x}}= 0\\ \end{align}

Even though we have added the part in green, we can show that the second term is zero as:

i=1N(yiw1xi(yw1x))=i=1Nyiw1i=1NxiNy+Nw1x=0\begin{align*} \sum_{i=1}^N \left( y_i - w_1x_i - \left(\overline{y} - w_1 \overline{x}\right) \right) = \textcolor{#ca0047}{\sum_{i=1}^N y_i} \textcolor{#8047cd}{- w_1 \sum_{i=1}^N x_i} \textcolor{#ca0047}{- N \overline{y}} \textcolor{#8047cd}{+ N w_1 \overline{x}} = 0 \end{align*}

Therefore, from equation (9)(9), we get:

    i=1N(yiw1xi(yw1x))(xix)=0\begin{equation*} \implies \sum_{i=1}^N \left( y_i - w_1x_i - \left(\overline{y} - w_1 \overline{x}\right) \right) \left(x_i - \overline{x}\right) = 0 \end{equation*}
w1=i=1N(yiy)(xix)i=1N(xix)2=rxyσyσx\begin{align} \therefore w_1 = \frac{\displaystyle\sum_{i=1}^N \left( y_i - \overline{y} \right) \left(x_i - \overline{x}\right)}{\displaystyle\sum_{i=1}^N \left( x_i - \overline{x} \right)^2} = r_{xy} \cdot \frac{\sigma_y}{\sigma_x} \end{align}

Here, σy\sigma_y and σx\sigma_x are standard deviations of xx and yy.

rxyr_{xy} is the Pearson's sample correlation coefficient between xx and yy defined as:

rxy=i=1N(yiy)(xix)i=1N(yiy)2i=1N(xix)2\begin{equation} r_{xy} = \frac{\displaystyle\sum_{i=1}^N \left( y_i - \overline{y} \right) \left(x_i - \overline{x}\right)}{\displaystyle\sqrt{\sum_{i=1}^N \left( y_i - \overline{y} \right)^2} \displaystyle\sqrt{\sum_{i=1}^N \left( x_i - \overline{x} \right)^2}} \end{equation}

Multiple Linear Regression

In multiple linear regression, there are multiple independent variables in the equation rather than just one. The mathematics behind multiple linear regression involves using matrix operations to solve for the coefficients of the regression equation.

The regression equation is now represented as:

y=w0+w1x1+w2x2++wdxd\begin{equation} y = w_0 + w_1x_1 + w_2x_2 + \cdots + w_dx_d \end{equation}

where x1,x2,,xdx_1, x_2, \cdots, x_d are the independent variables and w0,w1,,wdw_0, w_1, \cdots, w_d are the coefficients of the equation.

We define w=(w0w1wd)\mathbf{w} = \begin{pmatrix} w_0 \\ w_1 \\ \vdots \\ w_d \end{pmatrix} and augment each data point as x=(1x1xd)\mathbf{x} = \begin{pmatrix} 1 \\ x_1 \\ \vdots \\ x_d \end{pmatrix}, so that the equation can be written as:

y=w,x=wTx\begin{equation} y = \langle \mathbf{w}, \mathbf{x} \rangle = \mathbf{w}^T \mathbf{x} \end{equation}

Least Squares Criterion

JN(w)=12NyΦw2\begin{equation} J_{\tiny N}(\mathbf{w}) = \frac{1}{2N} \| \mathbf{y} - \Phi \mathbf{w}\|^2 \end{equation}


y=(y1y2yN)\mathbf{y} = \begin{pmatrix} y_1 \\ y_2 \\ \vdots \\ y_{\tiny N} \end{pmatrix} is the vector of target values.

Φw=(x1,wx2,wxN,w)=(x1Twx2TwxNTw)=(x1Tx2TxNT)w\Phi \mathbf{w} = \begin{pmatrix} \langle \mathbf{x_1}, \mathbf{w} \rangle \\ \langle \mathbf{x_2}, \mathbf{w} \rangle \\ \vdots \\ \langle \mathbf{x_{\tiny N}}, \mathbf{w} \rangle \end{pmatrix} = \begin{pmatrix} \mathbf{x_1}^T \mathbf{w} \\ \mathbf{x_2}^T \mathbf{w} \\ \vdots \\ \mathbf{x_{\tiny N}}^T \mathbf{w} \end{pmatrix} = \begin{pmatrix} \mathbf{x_1}^T \\ \mathbf{x_2}^T \\ \vdots \\ \mathbf{x_{\tiny N}}^T \end{pmatrix}\mathbf{w} is the vector of predictions.

Φ=(x1Tx2TxNT)=(1x11x12x1d1x21x22x2d1xN1xN2xNd)\Phi = \begin{pmatrix} \mathbf{x_1}^T \\ \mathbf{x_2}^T \\ \vdots \\ \mathbf{x_{\tiny N}}^T \end{pmatrix} = \begin{pmatrix} 1 & x_{11} & x_{12} & \cdots & x_{1d} \\ 1 & x_{21} & x_{22} & \cdots & x_{2d} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 1 & x_{\tiny{N}1} & x_{\tiny{N}2} & \cdots & x_{\tiny{N}d} \end{pmatrix} is called the design matrix. It is of size N×(d+1)N \times (d+1).

w=(w0w1wd)\mathbf{w} = \begin{pmatrix} w_0 \\ w_1 \\ \vdots \\ w_d \end{pmatrix} is the vector of weights.

We want to find the minimizer for JN(w)J_{\tiny N}(\mathbf{w}), so we solve:

JN(w)=0\begin{equation*} \nabla J_{\tiny N}(\mathbf{w}) = 0 \end{equation*}
    w(yΦw2)=0    2ΦT(yΦw)=0    ΦTy=ΦTΦw\begin{align*} \implies &\nabla_{\mathbf{w}} \begin{pmatrix} \| \mathbf{y} - \Phi \mathbf{w}\|^2 \end{pmatrix} = 0\\ \implies &-2\Phi^T \left( \mathbf{y} - \Phi \mathbf{w} \right) = 0\\ \implies &\Phi^T \mathbf{y} = \Phi^T \Phi \mathbf{w} \end{align*}
w^=(ΦTΦ)1ΦTy\begin{equation} \therefore \mathbf{\hat{w}} = \left( \Phi^T \Phi \right)^{-1} \Phi^T \mathbf{y} \end{equation}

Note that we are implicitly assuming that ΦTΦ\Phi^T \Phi is an invertible matrix. If either the number of linearly independent examples is less than the number of features, or if the features are not linearly independent, then ΦTΦ\Phi^T \Phi is not invertible.

This least squared solution is an estimator for w\mathbf{w}. It represents the vector normal to the hyperplane that minimizes the sum of squared errors between the target values and the predictions.

However, this is not enough, we need to be sure of how confident we are in our predictions.

Homoscedastic Model

Assume that target values yiy_i are represented by random variables Y1,Y2,,YNY_1, Y_2, \cdots, Y_{\tiny N} that are independent and identically distributed (i.i.d.). Then,

Yi=w,xi+ϵi\begin{equation} Y_i = \langle \mathbf{w^{*}}, \mathbf{x_i} \rangle + \epsilon_i \end{equation}

where w\mathbf{w^{*}} is the vector of true weights that generates the data and ϵiN(0,σ2)\mathbf{\epsilon}_i \sim \mathcal{N}(0, \sigma^2) is an independent gaussian noise term.

We can write Y\mathbf{Y} as:

Y=Φw+ϵ    (Y1Y2YN)=(x1,wx2,wxN,w)+(ϵ1ϵ2ϵN)\begin{align} &\mathbf{Y} = \Phi \mathbf{w^{*}} + \Large{\mathbf{\epsilon}}\\ \implies & \begin{pmatrix} Y_1 \\ Y_2 \\ \vdots \\ Y_{\tiny N} \end{pmatrix} = \begin{pmatrix} \langle \mathbf{x_1}, \mathbf{w^{*}} \rangle \\ \langle \mathbf{x_2}, \mathbf{w^{*}} \rangle \\ \vdots \\ \langle \mathbf{x_{\tiny N}}, \mathbf{w^{*}} \rangle \end{pmatrix} + \begin{pmatrix} \epsilon_1 \\ \epsilon_2 \\ \vdots \\ \epsilon_{\tiny N} \end{pmatrix}\\ \end{align}

Here, ϵ\Large \mathbf{\epsilon} N(0,σ2IN)\sim \mathcal{N}(\mathbf{0}, \sigma^2 I_{\tiny N}) where INI_{\tiny N} is the N×NN \times N identity matrix.

We can show this as:

E[(ϵ1ϵ2ϵN)]=[E(ϵ1)E(ϵ2)E(ϵN)]=0\mathbb{E} \left[\begin{pmatrix} \epsilon_1 \\ \epsilon_2 \\ \vdots \\ \epsilon_{\tiny N} \end{pmatrix} \right] = \left[\begin{array}{c} \mathbb{E} \left( \epsilon_1 \right) \\ \mathbb{E} \left( \epsilon_2 \right) \\ \vdots \\ \mathbb{E} \left(\epsilon_{\tiny N}\right) \end{array}\right] = \mathbf{0}

Cov(ϵ)=(Cov(ϵi,ϵj))1 i, j N=(σ20000σ200000σ2)=σ2IN\mathrm{Cov}(\mathbf{\Large{\epsilon}}) = \begin{pmatrix} \mathrm{Cov}(\epsilon_i, \epsilon_j) \end{pmatrix}_{1 \leq~i,~j~\leq N} = \begin{pmatrix} \sigma^2 & 0 & 0 & \cdots & 0 \\ 0 & \sigma^2 & 0 & \cdots & 0 \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & 0 & \cdots & \sigma^2 \end{pmatrix} = \sigma^2 I_{\tiny N}

 ij:Cov(ϵi,ϵj)=0\because \forall ~i \neq j \mathrm{: } \mathrm{Cov}(\epsilon_i, \epsilon_j) = 0 and  i=j :Cov(ϵi,ϵj)=Var(ϵi)=σ2\forall ~i = j \mathrm{~: Cov}(\epsilon_i, \epsilon_j) = \mathrm{Var}(\epsilon_i) = \sigma^2

w^\mathbf{\hat{w}} is our estimator for w\mathbf{w^{*}}. We can write:

w^=(ΦTΦ)1ΦTY\begin{equation} \mathbf{\hat{w}} = \left( \Phi^T \Phi \right)^{-1} \Phi^T \mathbf{Y} \end{equation}

Substituting value of Y\mathbf{Y} from equation (17)(17), we can write:

w^=(ΦTΦ)1ΦT(Φw+ϵ)    w^=(ΦTΦ)1ΦTΦw+(ΦTΦ)1ΦTϵ    w^=w+(ΦTΦ)1ΦTϵ\begin{align} &\mathbf{\hat{w}} = \left( \Phi^T \Phi \right)^{-1} \Phi^T \left( \Phi \mathbf{w^{*}} + \Large{\mathbf{\epsilon}} \right)\\ \implies &\mathbf{\hat{w}} = \left( \Phi^T \Phi \right)^{-1} \Phi^T \Phi \mathbf{w^{*}} + \left( \Phi^T \Phi \right)^{-1} \Phi^T \Large{\mathbf{\epsilon}}\\ \implies &\mathbf{\hat{w}} = \mathbf{w^{*}} + \left( \Phi^T \Phi \right)^{-1} \Phi^T \Large{\mathbf{\epsilon}} \end{align}

Then taking expectation of both sides of the equation (22)(22), we get:

    E[w^]=E[w]+E[(ΦTΦ)1ΦTϵ]    E[w^]=w+E[(ΦTΦ)1ΦTϵ]    E[w^]=w+(ΦTΦ)1ΦTE[ϵ]    E[w^]=w+(ΦTΦ)1ΦT0\begin{align*} \implies &\mathbb{E} \left[ \mathbf{\hat{w}} \right] = \mathbb{E} \left[ \mathbf{w^{*}} \right] + \mathbb{E} \left[ \left( \Phi^T \Phi \right)^{-1} \Phi^T \Large{\mathbf{\epsilon}} \right]\\ \implies &\mathbb{E} \left[ \mathbf{\hat{w}} \right] = \mathbf{w^{*}} + \mathbb{E} \left[ \left( \Phi^T \Phi \right)^{-1} \Phi^T \Large{\mathbf{\epsilon}} \right]\\ \implies &\mathbb{E} \left[ \mathbf{\hat{w}} \right] = \mathbf{w^{*}} + \left( \Phi^T \Phi \right)^{-1} \Phi^T \mathbb{E} \left[ \Large{\mathbf{\epsilon}} \right]\\ \implies &\mathbb{E} \left[ \mathbf{\hat{w}} \right] = \mathbf{w^{*}} + \left( \Phi^T \Phi \right)^{-1} \Phi^T \mathbf{0}\\ \end{align*}
E[w^]=w\begin{equation} \therefore \mathbb{E} \left[ \mathbf{\hat{w}} \right] = \mathbf{w^{*}} \end{equation}

We can see that our least-squared estimator w^\mathbf{\hat{w}} is unbiased in probability with the true w\mathbf{w^{*}}.


Any affine transformation of a gaussian vector zN(0,Σ)\mathbf{z} \sim \mathcal{N} (\mathbf{0}, \Sigma) is also a gaussian vector:

Then, if AA is a matrix (that represents a linear transformation):

AzN(0,AΣAT)\begin{equation} A\mathbf{z} \sim \mathcal{N} (\mathbf{0}, A \Sigma A^T) \end{equation}

Then, if A=(ΦTΦ)1ΦTA = \left( \Phi^T \Phi \right)^{-1} \Phi^T:

w^N(w,Aσ2INAT)=N(w,σ2(ΦTΦ)1ΦT((ΦTΦ)1ΦT)T)=N(w,σ2(ΦTΦ)1ΦTΦ(ΦTΦ)T)=N(w,σ2(ΦTΦ)1ΦTΦ(ΦTΦ)1) [ (ΦTΦ)T=(ΦTΦ) ]\begin{align*} \mathbf{\hat{w}} &\sim \mathcal{N} ( \mathbf{w^{*}}, A \sigma^2 \mathbb{I}_N A^T )\\ &= \mathcal{N}( \mathbf{w^{*}}, \sigma^2 \left( \Phi^T \Phi \right)^{-1} \Phi^T ( \left( \Phi^T \Phi \right)^{-1} \Phi^T)^T)\\ &= \mathcal{N}( \mathbf{w^{*}}, \sigma^2 \left( \Phi^T \Phi \right)^{-1} \Phi^T \Phi \left( \Phi^T \Phi \right)^{-T})\\ &= \mathcal{N}( \mathbf{w^{*}}, \sigma^2 \left( \Phi^T \Phi \right)^{-1} \Phi^T \Phi \left( \Phi^T \Phi \right)^{-1}) ~[~\because\left(\Phi^T \Phi \right)^{T} = \left( \Phi^T \Phi \right)~] \end{align*}
w^N(w,σ2(ΦTΦ)1)\begin{equation} \therefore \mathbf{\hat{w}} \sim \mathcal{N}( \mathbf{w^{*}}, \sigma^2 \left( \Phi^T \Phi \right)^{-1} ) \end{equation}

Confidence Interval

To compute confidence interval on the weights i.e. on the ithi^{th} component of w^=(w0^w1^wd^)N(w,σ2(ΦTΦ)1)\mathbf{\hat{w}} = \begin{pmatrix} \hat{w_0} \\ \hat{w_1} \\ \vdots \\ \hat{w_d} \end{pmatrix} \sim \mathcal{N}( \mathbf{w^{*}}, \sigma^2 \left( \Phi^T \Phi \right)^{-1} ) , we define:

S=(ΦTΦ)1=(s00s11s1ds10s11s1dsd0sd1sdd)S = \left( \Phi^T \Phi \right)^{-1} = \begin{pmatrix} s_{00} & s_{11} & \cdots & s_{1d} \\ s_{10} & s_{11} & \cdots & s_{1d} \\ \vdots & \vdots & \ddots & \vdots \\ s_{d0} & s_{d1} & \cdots & s_{dd} \end{pmatrix}

Then, we can write, Var(w^)=σ2S\text{Var}(\hat{w}) = \sigma^2 S, which implies:

Var(wi^)=σ2siiVar(\hat{w_i}) = \sigma^2 s_{ii}


wi^N(wi,σ2sii)    wi^wiσsiiN(0,1)\begin{align*} \hat{w_i} &\sim \mathcal{N}( w^{*}_i, \sigma^2 s_{ii})\\ \implies \frac{\hat{w_i} - w^{*}_i}{\sigma \sqrt{s_{ii}}} &\sim \mathcal{N}(0, 1)\\ \end{align*}

Then, we can write:

P(zα/2wi^wiσsiizα/2)=1α    P(wi^zα/2σsiiwiwi^+zα/2σsii)=1α\begin{align*} \mathbb{P} \left( -z_{\alpha/2} \leq \frac{\hat{w_i} - w^{*}_i}{\sigma \sqrt{s_{ii}}} \leq z_{\alpha/2} \right) &= 1 - \alpha\\ \implies \mathbb{P} \left( \hat{w_i} - z_{\alpha/2} \sigma \sqrt{s_{ii}} \leq w^{*}_i \leq \hat{w_i} + z_{\alpha/2} \sigma \sqrt{s_{ii}} \right) &= 1 - \alpha\\ \end{align*}

For example, say that we want to compute the 95%95\% confidence interval for the ithi^{th} component of w^\mathbf{\hat{w}}, then α=0.05\alpha = 0.05 and zα/2=1.96z_{\alpha/2} = 1.96:

    P(wi^1.96σsiiwiwi^+1.96σsii)=0.95    P(wi[wi^±1.96σsii ])=0.95\begin{align*} \implies \mathbb{P} \left( \hat{w_i} - 1.96 \sigma \sqrt{s_{ii}} \leq w^{*}_i \leq \hat{w_i} + 1.96 \sigma \sqrt{s_{ii}} \right) &= 0.95\\ \implies \mathbb{P} \left( w^{*}_i \in \left[ \hat{w_i} \pm 1.96 \sigma \sqrt{s_{ii}} ~\right] \right) &= 0.95\\ \end{align*}

Here, [wi^±1.96σsii ]\left[ \hat{w_i} \pm 1.96 \sigma \sqrt{s_{ii}} ~\right] is the 95%95\% confidence interval for the ithi^{th} component of w^\mathbf{\hat{w}}.

Practical Problem

In practice, we don't know the true σ2\sigma^2 and we have to estimate it from the data. We can estimate it using the following formula:

σ^2=1Nd1i=1N(Yiw^,xi)2\begin{equation} \hat{\sigma}^2 = \frac{1}{N - d - 1} \sum_{i=1}^{N} \left( Y_i - \langle \mathbf{\hat{w}}, \mathbf{x_i} \rangle \right)^2 \end{equation}

Here, dd is the number of features and NN is the number of samples.

The term Nd1N - d - 1 rather than NN in the denominator is used to make σ^2\hat{\sigma}^2 an unbiased estimate of σ2\sigma^2.

Now, the empirical variance of the residual is nearly equal to σ2\sigma^2. The term (wi^wiσ^sii)\left( \frac{\hat{w_i} - w^{*}_i}{\hat{\sigma} \sqrt{s_{ii}}} \right) is now asymptotically gaussian with mean 00 and variance 11.


  • If the value of NN is small, it follows a t-distribution (Student distribution). The exact Confidence Intervals can even be derived using Cochran's theorem.
  • Our math for confidence intervals relies on the assumption that our data is homoscedastic. In the case that it is not, we can compute the confidence intervals empirically using the bootstrap method, which is based on resampling and the law of large numbers.

Bias-Variance Tradeoff


Currently, we are using the following loss function:

L(w)=12NyΦw2\begin{equation} \mathcal{L}(\mathbf{w}) = \frac{1}{2N} \| \mathbf{y} - \Phi \mathbf{w}\|^2 \end{equation}

We can also add a regularization term in the loss function, to prevent overfitting:

L(w)=12NyΦw2+λ2w2\begin{equation} \mathcal{L}(\mathbf{w}) = \frac{1}{2N} \| \mathbf{y} - \Phi \mathbf{w}\|^2 + \frac{\lambda}{2} \| \mathbf{w} \|^2 \end{equation}

This is called Ridge or L2 Regularization. Here, λ\lambda is the regularization parameter. It is a hyperparameter and can be tuned using cross-validation.

Note: We can also use other regularization methods like lasso (L1) regularization, square-root lasso, etc.

Now, to find a value of w\mathbf{w} that minimizes the loss function, we will solve:

L(w)=0\begin{equation*} \nabla \mathcal{L}(\mathbf{w}) = 0 \end{equation*}
    w(12NyΦw2+λ2w2)=0    1NΦT(yΦw)+λw=0    ΦTΦw+Nλw=ΦTy    (ΦTΦ+NλId)w=ΦTy\begin{align*} &\implies \nabla_w \left( \frac{1}{2N} \| \mathbf{y} - \Phi \mathbf{w}\|^2 + \frac{\lambda}{2} \| \mathbf{w} \|^2 \right) = 0\\ &\implies \frac{1}{N} \Phi^T \left( \mathbf{y} - \Phi \mathbf{w} \right) + \lambda \mathbf{w} = 0\\ &\implies \Phi^T \Phi \mathbf{w} + N \lambda \mathbf{w} = \Phi^T \mathbf{y}\\ &\implies \left( \Phi^T \Phi + N \lambda \mathbb{I_d} \right) \mathbf{w} = \Phi^T \mathbf{y}\\ \end{align*}
w^λ=(ΦTΦ+NλId)1ΦTy\begin{equation} \therefore \mathbf{\hat{w}_\lambda} = \left( \Phi^T \Phi + N \lambda \mathbb{I_d} \right)^{-1} \Phi^T \mathbf{y} \end{equation}

Note: Now, even if d>Nd > N, (ΦTΦ+NλId)1\left( \Phi^T \Phi + N \lambda \mathbb{I_d} \right)^{-1} is well defined because of the shift.

Prediction Error

Let's now look at the impact λ\lambda on the generalization error.

yiyi^=w,xi+ϵiwλ^,xi=ww^λ,xi+ϵi\begin{align} y_i - \hat{y_i} &= \langle \mathbf{w}^*, \mathbf{x_i} \rangle + \epsilon_i - \langle \mathbf{\hat{w_\lambda}}, \mathbf{x_i} \rangle\\ &= \langle \mathbf{w}^* - \mathbf{\hat{w}_\lambda}, \mathbf{x_i} \rangle + \epsilon_i\\ \end{align}


ww^λ=w(ΦTΦ+NλId)1ΦTy=w(ΦTΦ+NλId)1ΦT(Φw+ϵ)=w(ΦTΦ+NλId)1ΦTΦw(ΦTΦ+NλId)1ΦTϵ=w(ΦTΦ+NλId)1(ΦTΦ+NλId)w+(ΦTΦ+NλId)1Nλw(ΦTΦ+NλId)1ΦTϵ=wIdw+(ΦTΦ+NλId)1Nλw(ΦTΦ+NλId)1ΦTϵ=(ΦTΦ+NλId)1Nλw(ΦTΦ+NλId)1ΦTϵ\begin{align*} \mathbf{w}^* - \mathbf{\hat{w}_\lambda} &= \mathbf{w}^* - \left( \Phi^T \Phi + N \lambda \mathbb{I_d} \right)^{-1} \Phi^T \mathbf{y}\\ &= \mathbf{w}^* - \left( \Phi^T \Phi + N \lambda \mathbb{I_d} \right)^{-1} \Phi^T \left( \Phi \mathbf{w}^* + \epsilon \right)\\ &= \mathbf{w}^* - \left( \Phi^T \Phi + N \lambda \mathbb{I_d} \right)^{-1} \Phi^T \Phi \mathbf{w}^* - \left( \Phi^T \Phi + N \lambda \mathbb{I_d} \right)^{-1} \Phi^T \epsilon\\ &= \mathbf{w}^* - \left( \Phi^T \Phi + N \lambda \mathbb{I_d} \right)^{-1} \textcolor{#01B636}{(} \Phi^T \Phi \textcolor{#01B636}{+ N \lambda \mathbb{I_d} )} \mathbf{w}^* \textcolor{#E63E6D}{+ \left( \Phi^T \Phi + N \lambda \mathbb{I_d} \right)^{-1} N \lambda \mathbf{w}^*} - \left( \Phi^T \Phi + N \lambda \mathbb{I_d} \right)^{-1} \Phi^T \epsilon\\ &= \textcolor{#01B636}{\mathbf{w}^*} \textcolor{#E63E6D}{- \mathbb{I_d} \mathbf{w}^*} + \left( \Phi^T \Phi + N \lambda \mathbb{I_d} \right)^{-1} N \lambda \mathbf{w}^* - \left( \Phi^T \Phi + N \lambda \mathbb{I_d} \right)^{-1} \Phi^T \epsilon\\ &= \left( \Phi^T \Phi + N \lambda \mathbb{I_d} \right)^{-1} N \lambda \mathbf{w}^* - \left( \Phi^T \Phi + N \lambda \mathbb{I_d} \right)^{-1} \Phi^T \epsilon \end{align*}

Substituting this value in the above equation, we get:

yiyi^=Nλ(ΦTΦ+NλId)1w,xi(ΦTΦ+NλId)1ΦTϵ,xi+ϵi\begin{align*} y_i - \hat{y_i} &= \textcolor{#C060A1}{\langle N \lambda \left( \Phi^T \Phi + N \lambda \mathbb{I_d} \right)^{-1} \mathbf{w}^*, \mathbf{x_i} \rangle} - \textcolor{#03C988}{\langle \left( \Phi^T \Phi + N \lambda \mathbb{I_d} \right)^{-1} \Phi^T \epsilon, \mathbf{x_i} \rangle} + \textcolor{8B9A46}{\epsilon_i}\\ \end{align*}


  • Nλ(ΦTΦ+NλId)1w,xi\textcolor{#C060A1}{\langle N \lambda \left( \Phi^T \Phi + N \lambda \mathbb{I_d} \right)^{-1} \mathbf{w}^*, \mathbf{x_i} \rangle} is the non-zero bias term, which is deterministic. (It is 00 if λ=0\lambda = 0.)
  • (ΦTΦ+NλId)1ΦTϵ,xi\textcolor{#03C988}{\langle \left( \Phi^T \Phi + N \lambda \mathbb{I_d} \right)^{-1} \Phi^T \epsilon, \mathbf{x_i} \rangle} is the variance term.
  • ϵi\textcolor{8B9A46}{\epsilon_i} is the noise term which is uncontrollable.

Here, if we take the expectation of yiyi^y_i - \hat{y_i}, we get a non-zero term, which is the bias term. It's a pity because our non-regularized model had an unbiased least squared estimator (expectation was the ground truth). Therefore, by introducing a bias in our model, we are either underestimating or overestimating the ground truth.

The interest of introducing a bias in our model is that it enables us to control the variance term. Therefore, with a small λ\lambda, we have a high variance and a low bias. On the other hand, with a large λ\lambda, we have a low variance and a high bias. We call this tradeoff the bias-variance tradeoff.