These are my notes that cover the mathematical foundations of Linear Regression, taken while attending CSCI-UA 9473 - Foundations of Machine Learning at NYU Paris. They make use of probability theory, statistics, calculus, optimization, and linear algebra to formalize the concept of linear regression.
Introduction
Linear regression is one of the most widely used statistical techniques in data analysis and machine learning. It provides a simple and intuitive linear model for modeling the relationship between a response variable and a set of explanatory variables.
Simple Linear Regression
Least Squares Criterion
J N ( w ) = 1 2 N ∑ i = 1 N ( y i − w 1 x i − w 0 ) 2 \begin{equation}
J_{\tiny N}(\mathbf{w}) = \frac{1}{2N} \sum_{i=1}^N \left( y_i - w_1x_i - w_0 \right)^2
\end{equation} J N ( w ) = 2 N 1 i = 1 ∑ N ( y i − w 1 x i − w 0 ) 2
where w = ( w 0 w 1 ) \mathbf{w} = \begin{pmatrix} w_0 \\ w_1 \end{pmatrix} w = ( w 0 w 1 ) . We want to find w \mathbf{w} w such that J N ( w ) J_{\tiny N}(\mathbf{w}) J N ( w ) is minimized.
Therefore, lets compute ∇ J N ( w ) \nabla J_{\tiny N}(\mathbf{w}) ∇ J N ( w ) :
∇ J N ( w ) = [ ∂ J N ∂ w 0 ( w ) ∂ J N ∂ w 1 ( w ) ] \begin{equation}
\nabla J_{\tiny N}(\mathbf{w}) = \left[\begin{array}{c}
\dfrac{\partial J_{\tiny N}}{\partial w_0}(\left.\mathbf{w}\right)\\ \\
\dfrac{\partial J_{\tiny N}}{\partial w_1}(\left.\mathbf{w}\right)\\
\end{array}\right]
\end{equation} ∇ J N ( w ) = ⎣ ⎡ ∂ w 0 ∂ J N ( w ) ∂ w 1 ∂ J N ( w ) ⎦ ⎤
where:
∂ J N ∂ w 0 ( w ) = − 1 N ∑ i = 1 N ( y i − w 1 x i − w 0 ) ∂ J N ∂ w 1 ( w ) = − 1 N ∑ i = 1 N ( y i − w 1 x i − w 0 ) x i \begin{align}
\dfrac{\partial J_{\tiny N}}{\partial w_0}(\left.\mathbf{w}\right) &= -\frac{1}{N} \sum_{i=1}^N \left( y_i - w_1x_i - w_0 \right)\\
\dfrac{\partial J_{\tiny N}}{\partial w_1}(\left.\mathbf{w}\right) &= -\frac{1}{N} \sum_{i=1}^N \left( y_i - w_1x_i - w_0 \right)x_i
\end{align} ∂ w 0 ∂ J N ( w ) ∂ w 1 ∂ J N ( w ) = − N 1 i = 1 ∑ N ( y i − w 1 x i − w 0 ) = − N 1 i = 1 ∑ N ( y i − w 1 x i − w 0 ) x i
Setting them equal to zero, we get the critical values ( w 0 , w 1 ) (w_0, w_1) ( w 0 , w 1 ) that minimize J N ( w ) J_{\tiny N}(\mathbf{w}) J N ( w ) :
∂ J N ∂ w 0 ( w ) = 0 \begin{equation*}
\dfrac{\partial J_{\tiny N}}{\partial w_0}(\left.\mathbf{w}\right) = 0
\end{equation*} ∂ w 0 ∂ J N ( w ) = 0
⟹ ∑ i = 1 N ( y i − w 1 x i − w 0 ) = 0 ⟹ w 0 = 1 N ∑ i = 1 N ( y i − w 1 x i ) ⟹ w 0 = ( 1 N ∑ i = 1 N y i ) − w 1 ( 1 N ∑ i = 1 N x i ) \begin{align*}
&\implies \sum_{i=1}^N \left( y_i - w_1x_i - w_0 \right) = 0\\
&\implies w_0 = \frac{1}{N} \sum_{i=1}^N \left( y_i - w_1x_i \right)\\
&\implies w_0 = \left(\frac{1}{N} \sum_{i=1}^N y_i \right) - w_1 \left(\frac{1}{N} \sum_{i=1}^N x_i\right)\\
\end{align*} ⟹ i = 1 ∑ N ( y i − w 1 x i − w 0 ) = 0 ⟹ w 0 = N 1 i = 1 ∑ N ( y i − w 1 x i ) ⟹ w 0 = ( N 1 i = 1 ∑ N y i ) − w 1 ( N 1 i = 1 ∑ N x i )
∴ w 0 = y ‾ − w 1 x ‾ \begin{equation}
\therefore w_0 = \overline{y} - w_1 \overline{x}
\end{equation} ∴ w 0 = y − w 1 x
Similarly:
∂ J N ∂ w 1 ( w ) = 0 \begin{equation*}
\dfrac{\partial J_{\tiny N}}{\partial w_1}(\left.\mathbf{w}\right) = 0
\end{equation*} ∂ w 1 ∂ J N ( w ) = 0
⟹ ∑ i = 1 N ( y i − w 1 x i − w 0 ) x i = 0 \begin{align*}
&\implies \sum_{i=1}^N \left( y_i - w_1x_i - w_0 \right)x_i = 0\\
\end{align*} ⟹ i = 1 ∑ N ( y i − w 1 x i − w 0 ) x i = 0
Replacing the value of w 0 w_0 w 0 , we get:
⟹ ∑ i = 1 N ( y i − w 1 x i − ( y ‾ − w 1 x ‾ ) ) x i = 0 ⟹ ∑ i = 1 N ( y i − y ‾ ) x i − w 1 ∑ i = 1 N ( x i − x ‾ ) x i = 0 ⟹ w 1 = ∑ i = 1 N ( y i − y ‾ ) x i ∑ i = 1 N ( x i − x ‾ ) x i \begin{align}
&\implies \sum_{i=1}^N \left( y_i - w_1x_i - \left(\overline{y} - w_1 \overline{x}\right) \right)x_i = 0\\
&\implies \sum_{i=1}^N \left( y_i - \overline{y} \right)x_i - w_1 \sum_{i=1}^N \left( x_i - \overline{x} \right)x_i = 0\\
&\implies w_1 = \frac{\displaystyle\sum_{i=1}^N \left( y_i - \overline{y} \right)x_i}{\displaystyle\sum_{i=1}^N \left( x_i - \overline{x} \right)x_i}\\
\end{align} ⟹ i = 1 ∑ N ( y i − w 1 x i − ( y − w 1 x ) ) x i = 0 ⟹ i = 1 ∑ N ( y i − y ) x i − w 1 i = 1 ∑ N ( x i − x ) x i = 0 ⟹ w 1 = i = 1 ∑ N ( x i − x ) x i i = 1 ∑ N ( y i − y ) x i
However, it is commonly written in terms of Pearson's sample correlation coefficient r x y r_{xy} r x y .
For that, a slight modification is required in equation ( 6 ) (6) ( 6 ) :
⟹ ∑ i = 1 N ( y i − w 1 x i − ( y ‾ − w 1 x ‾ ) ) ( x i − x ‾ ) + ∑ i = 1 N ( y i − w 1 x i − ( y ‾ − w 1 x ‾ ) ) x ‾ = 0 \begin{align}
&\implies \sum_{i=1}^N \left( y_i - w_1x_i - \left(\overline{y} - w_1 \overline{x}\right) \right) \textcolor{#01B636}{(}x_i \textcolor{#01B636}{- \overline{x}) + \sum_{i=1}^N \left( y_i - w_1x_i - \left(\overline{y} - w_1 \overline{x}\right) \right) \overline{x}}= 0\\
\end{align} ⟹ i = 1 ∑ N ( y i − w 1 x i − ( y − w 1 x ) ) ( x i − x ) + i = 1 ∑ N ( y i − w 1 x i − ( y − w 1 x ) ) x = 0
Even though we have added the part in green, we can show that the second term is zero as:
∑ i = 1 N ( y i − w 1 x i − ( y ‾ − w 1 x ‾ ) ) = ∑ i = 1 N y i − w 1 ∑ i = 1 N x i − N y ‾ + N w 1 x ‾ = 0 \begin{align*}
\sum_{i=1}^N \left( y_i - w_1x_i - \left(\overline{y} - w_1 \overline{x}\right) \right) = \textcolor{#ca0047}{\sum_{i=1}^N y_i} \textcolor{#8047cd}{- w_1 \sum_{i=1}^N x_i} \textcolor{#ca0047}{- N \overline{y}} \textcolor{#8047cd}{+ N w_1 \overline{x}} = 0
\end{align*} i = 1 ∑ N ( y i − w 1 x i − ( y − w 1 x ) ) = i = 1 ∑ N y i − w 1 i = 1 ∑ N x i − N y + N w 1 x = 0
Therefore, from equation ( 9 ) (9) ( 9 ) , we get:
⟹ ∑ i = 1 N ( y i − w 1 x i − ( y ‾ − w 1 x ‾ ) ) ( x i − x ‾ ) = 0 \begin{equation*}
\implies \sum_{i=1}^N \left( y_i - w_1x_i - \left(\overline{y} - w_1 \overline{x}\right) \right) \left(x_i - \overline{x}\right) = 0
\end{equation*} ⟹ i = 1 ∑ N ( y i − w 1 x i − ( y − w 1 x ) ) ( x i − x ) = 0
∴ w 1 = ∑ i = 1 N ( y i − y ‾ ) ( x i − x ‾ ) ∑ i = 1 N ( x i − x ‾ ) 2 = r x y ⋅ σ y σ x \begin{align}
\therefore w_1 = \frac{\displaystyle\sum_{i=1}^N \left( y_i - \overline{y} \right) \left(x_i - \overline{x}\right)}{\displaystyle\sum_{i=1}^N \left( x_i - \overline{x} \right)^2} = r_{xy} \cdot \frac{\sigma_y}{\sigma_x}
\end{align} ∴ w 1 = i = 1 ∑ N ( x i − x ) 2 i = 1 ∑ N ( y i − y ) ( x i − x ) = r x y ⋅ σ x σ y
Here, σ y \sigma_y σ y and σ x \sigma_x σ x are standard deviations of x x x and y y y .
r x y r_{xy} r x y is the Pearson's sample correlation coefficient between x x x and y y y defined as:
r x y = ∑ i = 1 N ( y i − y ‾ ) ( x i − x ‾ ) ∑ i = 1 N ( y i − y ‾ ) 2 ∑ i = 1 N ( x i − x ‾ ) 2 \begin{equation}
r_{xy} = \frac{\displaystyle\sum_{i=1}^N \left( y_i - \overline{y} \right) \left(x_i - \overline{x}\right)}{\displaystyle\sqrt{\sum_{i=1}^N \left( y_i - \overline{y} \right)^2} \displaystyle\sqrt{\sum_{i=1}^N \left( x_i - \overline{x} \right)^2}}
\end{equation} r x y = i = 1 ∑ N ( y i − y ) 2 i = 1 ∑ N ( x i − x ) 2 i = 1 ∑ N ( y i − y ) ( x i − x )
Multiple Linear Regression
In multiple linear regression, there are multiple independent variables in the equation rather than just one. The mathematics behind multiple linear regression involves using matrix operations to solve for the coefficients of the regression equation.
The regression equation is now represented as:
y = w 0 + w 1 x 1 + w 2 x 2 + ⋯ + w d x d \begin{equation}
y = w_0 + w_1x_1 + w_2x_2 + \cdots + w_dx_d
\end{equation} y = w 0 + w 1 x 1 + w 2 x 2 + ⋯ + w d x d
where x 1 , x 2 , ⋯ , x d x_1, x_2, \cdots, x_d x 1 , x 2 , ⋯ , x d are the independent variables and w 0 , w 1 , ⋯ , w d w_0, w_1, \cdots, w_d w 0 , w 1 , ⋯ , w d are the coefficients of the equation.
We define w = ( w 0 w 1 ⋮ w d ) \mathbf{w} = \begin{pmatrix} w_0 \\ w_1 \\ \vdots \\ w_d \end{pmatrix} w = ⎝ ⎛ w 0 w 1 ⋮ w d ⎠ ⎞ and augment each data point as x = ( 1 x 1 ⋮ x d ) \mathbf{x} = \begin{pmatrix} 1 \\ x_1 \\ \vdots \\ x_d \end{pmatrix} x = ⎝ ⎛ 1 x 1 ⋮ x d ⎠ ⎞ , so that the equation can be written as:
y = ⟨ w , x ⟩ = w T x \begin{equation}
y = \langle \mathbf{w}, \mathbf{x} \rangle = \mathbf{w}^T \mathbf{x}
\end{equation} y = ⟨ w , x ⟩ = w T x
Least Squares Criterion
J N ( w ) = 1 2 N ∥ y − Φ w ∥ 2 \begin{equation}
J_{\tiny N}(\mathbf{w}) = \frac{1}{2N} \| \mathbf{y} - \Phi \mathbf{w}\|^2
\end{equation} J N ( w ) = 2 N 1 ∥ y − Φ w ∥ 2
where:
y = ( y 1 y 2 ⋮ y N ) \mathbf{y} = \begin{pmatrix} y_1 \\ y_2 \\ \vdots \\ y_{\tiny N} \end{pmatrix} y = ⎝ ⎛ y 1 y 2 ⋮ y N ⎠ ⎞ is the vector of target values.
Φ w = ( ⟨ x 1 , w ⟩ ⟨ x 2 , w ⟩ ⋮ ⟨ x N , w ⟩ ) = ( x 1 T w x 2 T w ⋮ x N T w ) = ( x 1 T x 2 T ⋮ x N T ) w \Phi \mathbf{w} = \begin{pmatrix} \langle \mathbf{x_1}, \mathbf{w} \rangle \\ \langle \mathbf{x_2}, \mathbf{w} \rangle \\ \vdots \\ \langle \mathbf{x_{\tiny N}}, \mathbf{w} \rangle \end{pmatrix} = \begin{pmatrix} \mathbf{x_1}^T \mathbf{w} \\ \mathbf{x_2}^T \mathbf{w} \\ \vdots \\ \mathbf{x_{\tiny N}}^T \mathbf{w} \end{pmatrix} = \begin{pmatrix} \mathbf{x_1}^T \\ \mathbf{x_2}^T \\ \vdots \\ \mathbf{x_{\tiny N}}^T \end{pmatrix}\mathbf{w} Φ w = ⎝ ⎛ ⟨ x 1 , w ⟩ ⟨ x 2 , w ⟩ ⋮ ⟨ x N , w ⟩ ⎠ ⎞ = ⎝ ⎛ x 1 T w x 2 T w ⋮ x N T w ⎠ ⎞ = ⎝ ⎛ x 1 T x 2 T ⋮ x N T ⎠ ⎞ w is the vector of predictions.
Φ = ( x 1 T x 2 T ⋮ x N T ) = ( 1 x 11 x 12 ⋯ x 1 d 1 x 21 x 22 ⋯ x 2 d ⋮ ⋮ ⋮ ⋱ ⋮ 1 x N 1 x N 2 ⋯ x N d ) \Phi = \begin{pmatrix} \mathbf{x_1}^T \\ \mathbf{x_2}^T \\ \vdots \\ \mathbf{x_{\tiny N}}^T \end{pmatrix} = \begin{pmatrix} 1 & x_{11} & x_{12} & \cdots & x_{1d} \\ 1 & x_{21} & x_{22} & \cdots & x_{2d} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 1 & x_{\tiny{N}1} & x_{\tiny{N}2} & \cdots & x_{\tiny{N}d} \end{pmatrix} Φ = ⎝ ⎛ x 1 T x 2 T ⋮ x N T ⎠ ⎞ = ⎝ ⎛ 1 1 ⋮ 1 x 11 x 21 ⋮ x N 1 x 12 x 22 ⋮ x N 2 ⋯ ⋯ ⋱ ⋯ x 1 d x 2 d ⋮ x N d ⎠ ⎞ is called the design matrix. It is of size N × ( d + 1 ) N \times (d+1) N × ( d + 1 ) .
w = ( w 0 w 1 ⋮ w d ) \mathbf{w} = \begin{pmatrix} w_0 \\ w_1 \\ \vdots \\ w_d \end{pmatrix} w = ⎝ ⎛ w 0 w 1 ⋮ w d ⎠ ⎞ is the vector of weights.
We want to find the minimizer for J N ( w ) J_{\tiny N}(\mathbf{w}) J N ( w ) , so we solve:
∇ J N ( w ) = 0 \begin{equation*}
\nabla J_{\tiny N}(\mathbf{w}) = 0
\end{equation*} ∇ J N ( w ) = 0
⟹ ∇ w ( ∥ y − Φ w ∥ 2 ) = 0 ⟹ − 2 Φ T ( y − Φ w ) = 0 ⟹ Φ T y = Φ T Φ w \begin{align*}
\implies &\nabla_{\mathbf{w}} \begin{pmatrix} \| \mathbf{y} - \Phi \mathbf{w}\|^2 \end{pmatrix} = 0\\
\implies &-2\Phi^T \left( \mathbf{y} - \Phi \mathbf{w} \right) = 0\\
\implies &\Phi^T \mathbf{y} = \Phi^T \Phi \mathbf{w}
\end{align*} ⟹ ⟹ ⟹ ∇ w ( ∥ y − Φ w ∥ 2 ) = 0 − 2 Φ T ( y − Φ w ) = 0 Φ T y = Φ T Φ w
∴ w ^ = ( Φ T Φ ) − 1 Φ T y \begin{equation}
\therefore \mathbf{\hat{w}} = \left( \Phi^T \Phi \right)^{-1} \Phi^T \mathbf{y}
\end{equation} ∴ w ^ = ( Φ T Φ ) − 1 Φ T y
Note that we are implicitly assuming that Φ T Φ \Phi^T \Phi Φ T Φ is an invertible matrix. If either the number of linearly independent examples is less than the number of features, or if the features are not linearly independent, then Φ T Φ \Phi^T \Phi Φ T Φ is not invertible.
This least squared solution is an estimator for w \mathbf{w} w . It represents the vector normal to the hyperplane that minimizes the sum of squared errors between the target values and the predictions.
However, this is not enough, we need to be sure of how confident we are in our predictions.
Homoscedastic Model
Assume that target values y i y_i y i are represented by random variables Y 1 , Y 2 , ⋯ , Y N Y_1, Y_2, \cdots, Y_{\tiny N} Y 1 , Y 2 , ⋯ , Y N that are independent and identically distributed (i.i.d.). Then,
Y i = ⟨ w ∗ , x i ⟩ + ϵ i \begin{equation}
Y_i = \langle \mathbf{w^{*}}, \mathbf{x_i} \rangle + \epsilon_i
\end{equation} Y i = ⟨ w ∗ , x i ⟩ + ϵ i
where w ∗ \mathbf{w^{*}} w ∗ is the vector of true weights that generates the data and ϵ i ∼ N ( 0 , σ 2 ) \mathbf{\epsilon}_i \sim \mathcal{N}(0, \sigma^2) ϵ i ∼ N ( 0 , σ 2 ) is an independent gaussian noise term.
We can write Y \mathbf{Y} Y as:
Y = Φ w ∗ + ϵ ⟹ ( Y 1 Y 2 ⋮ Y N ) = ( ⟨ x 1 , w ∗ ⟩ ⟨ x 2 , w ∗ ⟩ ⋮ ⟨ x N , w ∗ ⟩ ) + ( ϵ 1 ϵ 2 ⋮ ϵ N ) \begin{align}
&\mathbf{Y} = \Phi \mathbf{w^{*}} + \Large{\mathbf{\epsilon}}\\
\implies & \begin{pmatrix} Y_1 \\ Y_2 \\ \vdots \\ Y_{\tiny N} \end{pmatrix} = \begin{pmatrix} \langle \mathbf{x_1}, \mathbf{w^{*}} \rangle \\ \langle \mathbf{x_2}, \mathbf{w^{*}} \rangle \\ \vdots \\ \langle \mathbf{x_{\tiny N}}, \mathbf{w^{*}} \rangle \end{pmatrix} + \begin{pmatrix} \epsilon_1 \\ \epsilon_2 \\ \vdots \\ \epsilon_{\tiny N} \end{pmatrix}\\
\end{align} ⟹ Y = Φ w ∗ + ϵ ⎝ ⎛ Y 1 Y 2 ⋮ Y N ⎠ ⎞ = ⎝ ⎛ ⟨ x 1 , w ∗ ⟩ ⟨ x 2 , w ∗ ⟩ ⋮ ⟨ x N , w ∗ ⟩ ⎠ ⎞ + ⎝ ⎛ ϵ 1 ϵ 2 ⋮ ϵ N ⎠ ⎞
Here, ϵ \Large \mathbf{\epsilon} ϵ ∼ N ( 0 , σ 2 I N ) \sim \mathcal{N}(\mathbf{0}, \sigma^2 I_{\tiny N}) ∼ N ( 0 , σ 2 I N ) where I N I_{\tiny N} I N is the N × N N \times N N × N identity matrix.
We can show this as:
E [ ( ϵ 1 ϵ 2 ⋮ ϵ N ) ] = [ E ( ϵ 1 ) E ( ϵ 2 ) ⋮ E ( ϵ N ) ] = 0 \mathbb{E} \left[\begin{pmatrix} \epsilon_1 \\ \epsilon_2 \\ \vdots \\ \epsilon_{\tiny N} \end{pmatrix} \right] = \left[\begin{array}{c} \mathbb{E} \left( \epsilon_1 \right) \\ \mathbb{E} \left( \epsilon_2 \right) \\ \vdots \\ \mathbb{E} \left(\epsilon_{\tiny N}\right) \end{array}\right] = \mathbf{0} E ⎣ ⎡ ⎝ ⎛ ϵ 1 ϵ 2 ⋮ ϵ N ⎠ ⎞ ⎦ ⎤ = ⎣ ⎡ E ( ϵ 1 ) E ( ϵ 2 ) ⋮ E ( ϵ N ) ⎦ ⎤ = 0
C o v ( ϵ ) = ( C o v ( ϵ i , ϵ j ) ) 1 ≤ i , j ≤ N = ( σ 2 0 0 ⋯ 0 0 σ 2 0 ⋯ 0 ⋮ ⋮ ⋮ ⋱ ⋮ 0 0 0 ⋯ σ 2 ) = σ 2 I N \mathrm{Cov}(\mathbf{\Large{\epsilon}}) = \begin{pmatrix} \mathrm{Cov}(\epsilon_i, \epsilon_j) \end{pmatrix}_{1 \leq~i,~j~\leq N} = \begin{pmatrix} \sigma^2 & 0 & 0 & \cdots & 0 \\ 0 & \sigma^2 & 0 & \cdots & 0 \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & 0 & \cdots & \sigma^2 \end{pmatrix} = \sigma^2 I_{\tiny N} Cov ( ϵ ) = ( Cov ( ϵ i , ϵ j ) ) 1 ≤ i , j ≤ N = ⎝ ⎛ σ 2 0 ⋮ 0 0 σ 2 ⋮ 0 0 0 ⋮ 0 ⋯ ⋯ ⋱ ⋯ 0 0 ⋮ σ 2 ⎠ ⎞ = σ 2 I N
∵ ∀ i ≠ j : C o v ( ϵ i , ϵ j ) = 0 \because \forall ~i \neq j \mathrm{: } \mathrm{Cov}(\epsilon_i, \epsilon_j) = 0 ∵ ∀ i = j : Cov ( ϵ i , ϵ j ) = 0 and ∀ i = j : C o v ( ϵ i , ϵ j ) = V a r ( ϵ i ) = σ 2 \forall ~i = j \mathrm{~: Cov}(\epsilon_i, \epsilon_j) = \mathrm{Var}(\epsilon_i) = \sigma^2 ∀ i = j : Cov ( ϵ i , ϵ j ) = Var ( ϵ i ) = σ 2
w ^ \mathbf{\hat{w}} w ^ is our estimator for w ∗ \mathbf{w^{*}} w ∗ . We can write:
w ^ = ( Φ T Φ ) − 1 Φ T Y \begin{equation}
\mathbf{\hat{w}} = \left( \Phi^T \Phi \right)^{-1} \Phi^T \mathbf{Y}
\end{equation} w ^ = ( Φ T Φ ) − 1 Φ T Y
Substituting value of Y \mathbf{Y} Y from equation ( 17 ) (17) ( 17 ) , we can write:
w ^ = ( Φ T Φ ) − 1 Φ T ( Φ w ∗ + ϵ ) ⟹ w ^ = ( Φ T Φ ) − 1 Φ T Φ w ∗ + ( Φ T Φ ) − 1 Φ T ϵ ⟹ w ^ = w ∗ + ( Φ T Φ ) − 1 Φ T ϵ \begin{align}
&\mathbf{\hat{w}} = \left( \Phi^T \Phi \right)^{-1} \Phi^T \left( \Phi \mathbf{w^{*}} + \Large{\mathbf{\epsilon}} \right)\\
\implies &\mathbf{\hat{w}} = \left( \Phi^T \Phi \right)^{-1} \Phi^T \Phi \mathbf{w^{*}} + \left( \Phi^T \Phi \right)^{-1} \Phi^T \Large{\mathbf{\epsilon}}\\
\implies &\mathbf{\hat{w}} = \mathbf{w^{*}} + \left( \Phi^T \Phi \right)^{-1} \Phi^T \Large{\mathbf{\epsilon}}
\end{align} ⟹ ⟹ w ^ = ( Φ T Φ ) − 1 Φ T ( Φ w ∗ + ϵ ) w ^ = ( Φ T Φ ) − 1 Φ T Φ w ∗ + ( Φ T Φ ) − 1 Φ T ϵ w ^ = w ∗ + ( Φ T Φ ) − 1 Φ T ϵ
Then taking expectation of both sides of the equation ( 22 ) (22) ( 22 ) , we get:
⟹ E [ w ^ ] = E [ w ∗ ] + E [ ( Φ T Φ ) − 1 Φ T ϵ ] ⟹ E [ w ^ ] = w ∗ + E [ ( Φ T Φ ) − 1 Φ T ϵ ] ⟹ E [ w ^ ] = w ∗ + ( Φ T Φ ) − 1 Φ T E [ ϵ ] ⟹ E [ w ^ ] = w ∗ + ( Φ T Φ ) − 1 Φ T 0 \begin{align*}
\implies &\mathbb{E} \left[ \mathbf{\hat{w}} \right] = \mathbb{E} \left[ \mathbf{w^{*}} \right] + \mathbb{E} \left[ \left( \Phi^T \Phi \right)^{-1} \Phi^T \Large{\mathbf{\epsilon}} \right]\\
\implies &\mathbb{E} \left[ \mathbf{\hat{w}} \right] = \mathbf{w^{*}} + \mathbb{E} \left[ \left( \Phi^T \Phi \right)^{-1} \Phi^T \Large{\mathbf{\epsilon}} \right]\\
\implies &\mathbb{E} \left[ \mathbf{\hat{w}} \right] = \mathbf{w^{*}} + \left( \Phi^T \Phi \right)^{-1} \Phi^T \mathbb{E} \left[ \Large{\mathbf{\epsilon}} \right]\\
\implies &\mathbb{E} \left[ \mathbf{\hat{w}} \right] = \mathbf{w^{*}} + \left( \Phi^T \Phi \right)^{-1} \Phi^T \mathbf{0}\\
\end{align*} ⟹ ⟹ ⟹ ⟹ E [ w ^ ] = E [ w ∗ ] + E [ ( Φ T Φ ) − 1 Φ T ϵ ] E [ w ^ ] = w ∗ + E [ ( Φ T Φ ) − 1 Φ T ϵ ] E [ w ^ ] = w ∗ + ( Φ T Φ ) − 1 Φ T E [ ϵ ] E [ w ^ ] = w ∗ + ( Φ T Φ ) − 1 Φ T 0
∴ E [ w ^ ] = w ∗ \begin{equation}
\therefore \mathbb{E} \left[ \mathbf{\hat{w}} \right] = \mathbf{w^{*}}
\end{equation} ∴ E [ w ^ ] = w ∗
We can see that our least-squared estimator w ^ \mathbf{\hat{w}} w ^ is unbiased in probability with the true w ∗ \mathbf{w^{*}} w ∗ .
Lemma
Any affine transformation of a gaussian vector z ∼ N ( 0 , Σ ) \mathbf{z} \sim \mathcal{N} (\mathbf{0}, \Sigma) z ∼ N ( 0 , Σ ) is also a gaussian vector:
Then, if A A A is a matrix (that represents a linear transformation):
A z ∼ N ( 0 , A Σ A T ) \begin{equation}
A\mathbf{z} \sim \mathcal{N} (\mathbf{0}, A \Sigma A^T)
\end{equation} A z ∼ N ( 0 , A Σ A T )
Then, if A = ( Φ T Φ ) − 1 Φ T A = \left( \Phi^T \Phi \right)^{-1} \Phi^T A = ( Φ T Φ ) − 1 Φ T :
w ^ ∼ N ( w ∗ , A σ 2 I N A T ) = N ( w ∗ , σ 2 ( Φ T Φ ) − 1 Φ T ( ( Φ T Φ ) − 1 Φ T ) T ) = N ( w ∗ , σ 2 ( Φ T Φ ) − 1 Φ T Φ ( Φ T Φ ) − T ) = N ( w ∗ , σ 2 ( Φ T Φ ) − 1 Φ T Φ ( Φ T Φ ) − 1 ) [ ∵ ( Φ T Φ ) T = ( Φ T Φ ) ] \begin{align*}
\mathbf{\hat{w}} &\sim \mathcal{N} ( \mathbf{w^{*}}, A \sigma^2 \mathbb{I}_N A^T )\\
&= \mathcal{N}( \mathbf{w^{*}}, \sigma^2 \left( \Phi^T \Phi \right)^{-1} \Phi^T ( \left( \Phi^T \Phi \right)^{-1} \Phi^T)^T)\\
&= \mathcal{N}( \mathbf{w^{*}}, \sigma^2 \left( \Phi^T \Phi \right)^{-1} \Phi^T \Phi \left( \Phi^T \Phi \right)^{-T})\\
&= \mathcal{N}( \mathbf{w^{*}}, \sigma^2 \left( \Phi^T \Phi \right)^{-1} \Phi^T \Phi \left( \Phi^T \Phi \right)^{-1}) ~[~\because\left(\Phi^T \Phi \right)^{T} = \left( \Phi^T \Phi \right)~]
\end{align*} w ^ ∼ N ( w ∗ , A σ 2 I N A T ) = N ( w ∗ , σ 2 ( Φ T Φ ) − 1 Φ T ( ( Φ T Φ ) − 1 Φ T ) T ) = N ( w ∗ , σ 2 ( Φ T Φ ) − 1 Φ T Φ ( Φ T Φ ) − T ) = N ( w ∗ , σ 2 ( Φ T Φ ) − 1 Φ T Φ ( Φ T Φ ) − 1 ) [ ∵ ( Φ T Φ ) T = ( Φ T Φ ) ]
∴ w ^ ∼ N ( w ∗ , σ 2 ( Φ T Φ ) − 1 ) \begin{equation}
\therefore \mathbf{\hat{w}} \sim \mathcal{N}( \mathbf{w^{*}}, \sigma^2 \left( \Phi^T \Phi \right)^{-1} )
\end{equation} ∴ w ^ ∼ N ( w ∗ , σ 2 ( Φ T Φ ) − 1 )
Confidence Interval
To compute confidence interval on the weights i.e. on the i t h i^{th} i t h component of w ^ = ( w 0 ^ w 1 ^ ⋮ w d ^ ) ∼ N ( w ∗ , σ 2 ( Φ T Φ ) − 1 ) \mathbf{\hat{w}} = \begin{pmatrix} \hat{w_0} \\ \hat{w_1} \\ \vdots \\ \hat{w_d} \end{pmatrix} \sim \mathcal{N}( \mathbf{w^{*}}, \sigma^2 \left( \Phi^T \Phi \right)^{-1} ) w ^ = ⎝ ⎛ w 0 ^ w 1 ^ ⋮ w d ^ ⎠ ⎞ ∼ N ( w ∗ , σ 2 ( Φ T Φ ) − 1 ) , we define:
S = ( Φ T Φ ) − 1 = ( s 00 s 11 ⋯ s 1 d s 10 s 11 ⋯ s 1 d ⋮ ⋮ ⋱ ⋮ s d 0 s d 1 ⋯ s d d ) S = \left( \Phi^T \Phi \right)^{-1} = \begin{pmatrix} s_{00} & s_{11} & \cdots & s_{1d} \\ s_{10} & s_{11} & \cdots & s_{1d} \\ \vdots & \vdots & \ddots & \vdots \\ s_{d0} & s_{d1} & \cdots & s_{dd} \end{pmatrix} S = ( Φ T Φ ) − 1 = ⎝ ⎛ s 00 s 10 ⋮ s d 0 s 11 s 11 ⋮ s d 1 ⋯ ⋯ ⋱ ⋯ s 1 d s 1 d ⋮ s dd ⎠ ⎞
Then, we can write, Var ( w ^ ) = σ 2 S \text{Var}(\hat{w}) = \sigma^2 S Var ( w ^ ) = σ 2 S , which implies:
V a r ( w i ^ ) = σ 2 s i i Var(\hat{w_i}) = \sigma^2 s_{ii} Va r ( w i ^ ) = σ 2 s ii
Now,
w i ^ ∼ N ( w i ∗ , σ 2 s i i ) ⟹ w i ^ − w i ∗ σ s i i ∼ N ( 0 , 1 ) \begin{align*}
\hat{w_i} &\sim \mathcal{N}( w^{*}_i, \sigma^2 s_{ii})\\
\implies \frac{\hat{w_i} - w^{*}_i}{\sigma \sqrt{s_{ii}}} &\sim \mathcal{N}(0, 1)\\
\end{align*} w i ^ ⟹ σ s ii w i ^ − w i ∗ ∼ N ( w i ∗ , σ 2 s ii ) ∼ N ( 0 , 1 )
Then, we can write:
P ( − z α / 2 ≤ w i ^ − w i ∗ σ s i i ≤ z α / 2 ) = 1 − α ⟹ P ( w i ^ − z α / 2 σ s i i ≤ w i ∗ ≤ w i ^ + z α / 2 σ s i i ) = 1 − α \begin{align*}
\mathbb{P} \left( -z_{\alpha/2} \leq \frac{\hat{w_i} - w^{*}_i}{\sigma \sqrt{s_{ii}}} \leq z_{\alpha/2} \right) &= 1 - \alpha\\
\implies \mathbb{P} \left( \hat{w_i} - z_{\alpha/2} \sigma \sqrt{s_{ii}} \leq w^{*}_i \leq \hat{w_i} + z_{\alpha/2} \sigma \sqrt{s_{ii}} \right) &= 1 - \alpha\\
\end{align*} P ( − z α /2 ≤ σ s ii w i ^ − w i ∗ ≤ z α /2 ) ⟹ P ( w i ^ − z α /2 σ s ii ≤ w i ∗ ≤ w i ^ + z α /2 σ s ii ) = 1 − α = 1 − α
For example, say that we want to compute the 95 % 95\% 95% confidence interval for the i t h i^{th} i t h component of w ^ \mathbf{\hat{w}} w ^ , then α = 0.05 \alpha = 0.05 α = 0.05 and z α / 2 = 1.96 z_{\alpha/2} = 1.96 z α /2 = 1.96 :
⟹ P ( w i ^ − 1.96 σ s i i ≤ w i ∗ ≤ w i ^ + 1.96 σ s i i ) = 0.95 ⟹ P ( w i ∗ ∈ [ w i ^ ± 1.96 σ s i i ] ) = 0.95 \begin{align*}
\implies \mathbb{P} \left( \hat{w_i} - 1.96 \sigma \sqrt{s_{ii}} \leq w^{*}_i \leq \hat{w_i} + 1.96 \sigma \sqrt{s_{ii}} \right) &= 0.95\\
\implies \mathbb{P} \left( w^{*}_i \in \left[ \hat{w_i} \pm 1.96 \sigma \sqrt{s_{ii}} ~\right] \right) &= 0.95\\
\end{align*} ⟹ P ( w i ^ − 1.96 σ s ii ≤ w i ∗ ≤ w i ^ + 1.96 σ s ii ) ⟹ P ( w i ∗ ∈ [ w i ^ ± 1.96 σ s ii ] ) = 0.95 = 0.95
Here, [ w i ^ ± 1.96 σ s i i ] \left[ \hat{w_i} \pm 1.96 \sigma \sqrt{s_{ii}} ~\right] [ w i ^ ± 1.96 σ s ii ] is the 95 % 95\% 95% confidence interval for the i t h i^{th} i t h component of w ^ \mathbf{\hat{w}} w ^ .
Practical Problem
In practice, we don't know the true σ 2 \sigma^2 σ 2 and we have to estimate it from the data. We can estimate it using the following formula:
σ ^ 2 = 1 N − d − 1 ∑ i = 1 N ( Y i − ⟨ w ^ , x i ⟩ ) 2 \begin{equation}
\hat{\sigma}^2 = \frac{1}{N - d - 1} \sum_{i=1}^{N} \left( Y_i - \langle \mathbf{\hat{w}}, \mathbf{x_i} \rangle \right)^2
\end{equation} σ ^ 2 = N − d − 1 1 i = 1 ∑ N ( Y i − ⟨ w ^ , x i ⟩ ) 2
Here, d d d is the number of features and N N N is the number of samples.
The term N − d − 1 N - d - 1 N − d − 1 rather than N N N in the denominator is used to make σ ^ 2 \hat{\sigma}^2 σ ^ 2 an unbiased estimate of σ 2 \sigma^2 σ 2 .
Now, the empirical variance of the residual is nearly equal to σ 2 \sigma^2 σ 2 . The term ( w i ^ − w i ∗ σ ^ s i i ) \left( \frac{\hat{w_i} - w^{*}_i}{\hat{\sigma} \sqrt{s_{ii}}} \right) ( σ ^ s ii w i ^ − w i ∗ ) is now asymptotically gaussian with mean 0 0 0 and variance 1 1 1 .
Notes
If the value of N N N is small, it follows a t-distribution (Student distribution). The exact Confidence Intervals can even be derived using Cochran's theorem.
Our math for confidence intervals relies on the assumption that our data is homoscedastic. In the case that it is not, we can compute the confidence intervals empirically using the bootstrap method , which is based on resampling and the law of large numbers.
Bias-Variance Tradeoff
Regularization
Currently, we are using the following loss function:
L ( w ) = 1 2 N ∥ y − Φ w ∥ 2 \begin{equation}
\mathcal{L}(\mathbf{w}) = \frac{1}{2N} \| \mathbf{y} - \Phi \mathbf{w}\|^2
\end{equation} L ( w ) = 2 N 1 ∥ y − Φ w ∥ 2
We can also add a regularization term in the loss function, to prevent overfitting:
L ( w ) = 1 2 N ∥ y − Φ w ∥ 2 + λ 2 ∥ w ∥ 2 \begin{equation}
\mathcal{L}(\mathbf{w}) = \frac{1}{2N} \| \mathbf{y} - \Phi \mathbf{w}\|^2 + \frac{\lambda}{2} \| \mathbf{w} \|^2
\end{equation} L ( w ) = 2 N 1 ∥ y − Φ w ∥ 2 + 2 λ ∥ w ∥ 2
This is called Ridge or L2 Regularization. Here, λ \lambda λ is the regularization parameter. It is a hyperparameter and can be tuned using cross-validation.
Note : We can also use other regularization methods like lasso (L1) regularization, square-root lasso, etc.
Now, to find a value of w \mathbf{w} w that minimizes the loss function, we will solve:
∇ L ( w ) = 0 \begin{equation*}
\nabla \mathcal{L}(\mathbf{w}) = 0
\end{equation*} ∇ L ( w ) = 0
⟹ ∇ w ( 1 2 N ∥ y − Φ w ∥ 2 + λ 2 ∥ w ∥ 2 ) = 0 ⟹ 1 N Φ T ( y − Φ w ) + λ w = 0 ⟹ Φ T Φ w + N λ w = Φ T y ⟹ ( Φ T Φ + N λ I d ) w = Φ T y \begin{align*}
&\implies \nabla_w \left( \frac{1}{2N} \| \mathbf{y} - \Phi \mathbf{w}\|^2 + \frac{\lambda}{2} \| \mathbf{w} \|^2 \right) = 0\\
&\implies \frac{1}{N} \Phi^T \left( \mathbf{y} - \Phi \mathbf{w} \right) + \lambda \mathbf{w} = 0\\
&\implies \Phi^T \Phi \mathbf{w} + N \lambda \mathbf{w} = \Phi^T \mathbf{y}\\
&\implies \left( \Phi^T \Phi + N \lambda \mathbb{I_d} \right) \mathbf{w} = \Phi^T \mathbf{y}\\
\end{align*} ⟹ ∇ w ( 2 N 1 ∥ y − Φ w ∥ 2 + 2 λ ∥ w ∥ 2 ) = 0 ⟹ N 1 Φ T ( y − Φ w ) + λ w = 0 ⟹ Φ T Φ w + N λ w = Φ T y ⟹ ( Φ T Φ + N λ I d ) w = Φ T y
∴ w ^ λ = ( Φ T Φ + N λ I d ) − 1 Φ T y \begin{equation}
\therefore \mathbf{\hat{w}_\lambda} = \left( \Phi^T \Phi + N \lambda \mathbb{I_d} \right)^{-1} \Phi^T \mathbf{y}
\end{equation} ∴ w ^ λ = ( Φ T Φ + N λ I d ) − 1 Φ T y
Note : Now, even if d > N d > N d > N , ( Φ T Φ + N λ I d ) − 1 \left( \Phi^T \Phi + N \lambda \mathbb{I_d} \right)^{-1} ( Φ T Φ + N λ I d ) − 1 is well defined because of the shift.
Prediction Error
Let's now look at the impact λ \lambda λ on the generalization error.
y i − y i ^ = ⟨ w ∗ , x i ⟩ + ϵ i − ⟨ w λ ^ , x i ⟩ = ⟨ w ∗ − w ^ λ , x i ⟩ + ϵ i \begin{align}
y_i - \hat{y_i} &= \langle \mathbf{w}^*, \mathbf{x_i} \rangle + \epsilon_i - \langle \mathbf{\hat{w_\lambda}}, \mathbf{x_i} \rangle\\
&= \langle \mathbf{w}^* - \mathbf{\hat{w}_\lambda}, \mathbf{x_i} \rangle + \epsilon_i\\
\end{align} y i − y i ^ = ⟨ w ∗ , x i ⟩ + ϵ i − ⟨ w λ ^ , x i ⟩ = ⟨ w ∗ − w ^ λ , x i ⟩ + ϵ i
Then,
w ∗ − w ^ λ = w ∗ − ( Φ T Φ + N λ I d ) − 1 Φ T y = w ∗ − ( Φ T Φ + N λ I d ) − 1 Φ T ( Φ w ∗ + ϵ ) = w ∗ − ( Φ T Φ + N λ I d ) − 1 Φ T Φ w ∗ − ( Φ T Φ + N λ I d ) − 1 Φ T ϵ = w ∗ − ( Φ T Φ + N λ I d ) − 1 ( Φ T Φ + N λ I d ) w ∗ + ( Φ T Φ + N λ I d ) − 1 N λ w ∗ − ( Φ T Φ + N λ I d ) − 1 Φ T ϵ = w ∗ − I d w ∗ + ( Φ T Φ + N λ I d ) − 1 N λ w ∗ − ( Φ T Φ + N λ I d ) − 1 Φ T ϵ = ( Φ T Φ + N λ I d ) − 1 N λ w ∗ − ( Φ T Φ + N λ I d ) − 1 Φ T ϵ \begin{align*}
\mathbf{w}^* - \mathbf{\hat{w}_\lambda} &= \mathbf{w}^* - \left( \Phi^T \Phi + N \lambda \mathbb{I_d} \right)^{-1} \Phi^T \mathbf{y}\\
&= \mathbf{w}^* - \left( \Phi^T \Phi + N \lambda \mathbb{I_d} \right)^{-1} \Phi^T \left( \Phi \mathbf{w}^* + \epsilon \right)\\
&= \mathbf{w}^* - \left( \Phi^T \Phi + N \lambda \mathbb{I_d} \right)^{-1} \Phi^T \Phi \mathbf{w}^* - \left( \Phi^T \Phi + N \lambda \mathbb{I_d} \right)^{-1} \Phi^T \epsilon\\
&= \mathbf{w}^* - \left( \Phi^T \Phi + N \lambda \mathbb{I_d} \right)^{-1} \textcolor{#01B636}{(} \Phi^T \Phi \textcolor{#01B636}{+ N \lambda \mathbb{I_d} )} \mathbf{w}^* \textcolor{#E63E6D}{+ \left( \Phi^T \Phi + N \lambda \mathbb{I_d} \right)^{-1} N \lambda \mathbf{w}^*} - \left( \Phi^T \Phi + N \lambda \mathbb{I_d} \right)^{-1} \Phi^T \epsilon\\
&= \textcolor{#01B636}{\mathbf{w}^*} \textcolor{#E63E6D}{- \mathbb{I_d} \mathbf{w}^*} + \left( \Phi^T \Phi + N \lambda \mathbb{I_d} \right)^{-1} N \lambda \mathbf{w}^* - \left( \Phi^T \Phi + N \lambda \mathbb{I_d} \right)^{-1} \Phi^T \epsilon\\
&= \left( \Phi^T \Phi + N \lambda \mathbb{I_d} \right)^{-1} N \lambda \mathbf{w}^* - \left( \Phi^T \Phi + N \lambda \mathbb{I_d} \right)^{-1} \Phi^T \epsilon
\end{align*} w ∗ − w ^ λ = w ∗ − ( Φ T Φ + N λ I d ) − 1 Φ T y = w ∗ − ( Φ T Φ + N λ I d ) − 1 Φ T ( Φ w ∗ + ϵ ) = w ∗ − ( Φ T Φ + N λ I d ) − 1 Φ T Φ w ∗ − ( Φ T Φ + N λ I d ) − 1 Φ T ϵ = w ∗ − ( Φ T Φ + N λ I d ) − 1 ( Φ T Φ + N λ I d ) w ∗ + ( Φ T Φ + N λ I d ) − 1 N λ w ∗ − ( Φ T Φ + N λ I d ) − 1 Φ T ϵ = w ∗ − I d w ∗ + ( Φ T Φ + N λ I d ) − 1 N λ w ∗ − ( Φ T Φ + N λ I d ) − 1 Φ T ϵ = ( Φ T Φ + N λ I d ) − 1 N λ w ∗ − ( Φ T Φ + N λ I d ) − 1 Φ T ϵ
Substituting this value in the above equation, we get:
y i − y i ^ = ⟨ N λ ( Φ T Φ + N λ I d ) − 1 w ∗ , x i ⟩ − ⟨ ( Φ T Φ + N λ I d ) − 1 Φ T ϵ , x i ⟩ + ϵ i \begin{align*}
y_i - \hat{y_i} &= \textcolor{#C060A1}{\langle N \lambda \left( \Phi^T \Phi + N \lambda \mathbb{I_d} \right)^{-1} \mathbf{w}^*, \mathbf{x_i} \rangle} - \textcolor{#03C988}{\langle \left( \Phi^T \Phi + N \lambda \mathbb{I_d} \right)^{-1} \Phi^T \epsilon, \mathbf{x_i} \rangle} + \textcolor{8B9A46}{\epsilon_i}\\
\end{align*} y i − y i ^ = ⟨ N λ ( Φ T Φ + N λ I d ) − 1 w ∗ , x i ⟩ − ⟨ ( Φ T Φ + N λ I d ) − 1 Φ T ϵ , x i ⟩ + ϵ i
Here:
⟨ N λ ( Φ T Φ + N λ I d ) − 1 w ∗ , x i ⟩ \textcolor{#C060A1}{\langle N \lambda \left( \Phi^T \Phi + N \lambda \mathbb{I_d} \right)^{-1} \mathbf{w}^*, \mathbf{x_i} \rangle} ⟨ N λ ( Φ T Φ + N λ I d ) − 1 w ∗ , x i ⟩ is the non-zero bias term, which is deterministic. (It is 0 0 0 if λ = 0 \lambda = 0 λ = 0 .)
⟨ ( Φ T Φ + N λ I d ) − 1 Φ T ϵ , x i ⟩ \textcolor{#03C988}{\langle \left( \Phi^T \Phi + N \lambda \mathbb{I_d} \right)^{-1} \Phi^T \epsilon, \mathbf{x_i} \rangle} ⟨ ( Φ T Φ + N λ I d ) − 1 Φ T ϵ , x i ⟩ is the variance term.
ϵ i \textcolor{8B9A46}{\epsilon_i} ϵ i is the noise term which is uncontrollable.
Here, if we take the expectation of y i − y i ^ y_i - \hat{y_i} y i − y i ^ , we get a non-zero term, which is the bias term. It's a pity because our non-regularized model had an unbiased least squared estimator (expectation was the ground truth). Therefore, by introducing a bias in our model, we are either underestimating or overestimating the ground truth.
The interest of introducing a bias in our model is that it enables us to control the variance term. Therefore, with a small λ \lambda λ , we have a high variance and a low bias. On the other hand, with a large λ \lambda λ , we have a low variance and a high bias. We call this tradeoff the bias-variance tradeoff .