These are my notes that cover the mathematical foundations of a dimensionality reduction method called Principal Component Analysis (PCA), taken while attending CSCI-UA 9473 - Foundations of Machine Learning at NYU Paris. They make use of linear algebra and statistics to formalize the concept of PCA.

Multivariate Statistics & Notation

Let X=(X1X2Xd)RdX = \begin{pmatrix} X^1 \\ X^2 \\ \vdots \\ X^d \end{pmatrix} \in \R^d be a random vector. We will use the superscript notation to denote the dd components of XX.

The expectation of XX is defined as:

E[X]=(E[X1]E[X2]E[Xd])Rd\mathbb{E}[X] = \begin{pmatrix} \mathbb{E}[X^1] \\ \mathbb{E}[X^2] \\ \vdots \\ \mathbb{E}[X^d] \end{pmatrix} \in \R^d

Similarly, the covariance matrix of XX, denoted by Σ\Sigma, is a d×dd \times d matrix defined such that:

Σij=σij=Cov(Xi,Xj)=E[XiXj]E[Xi]E[Xj]\Sigma_{ij} = \sigma_{ij} = \text{Cov}(X^i, X^j) = \mathbb{E}[X^iX^j] - \mathbb{E}[X^i]\mathbb{E}[X^j]

We can write the whole covariance matrix in the following vectorized form:

Σ=E[XX]E[X]E[X]\begin{equation} \Sigma = \mathbb{E}[XX^\intercal] - \mathbb{E}[X]\mathbb{E}[X]^\intercal \end{equation}

Note: This is because (E[XX])ij=E[(XX)ij]=E[XiXj](\mathbb{E}[XX^\intercal])_{ij} = \mathbb{E}[(XX^\intercal)_{ij}] = \mathbb{E}[X^iX^j]. Recall that:

XX=(X1X2Xd)(X1X2Xd)=(X1X1X1X2X1XdX2X1X2X2X2XdXdX1XdX2XdXd)XX^\intercal = \begin{pmatrix} X^1 \\ X^2 \\ \vdots \\ X^d \end{pmatrix} \begin{pmatrix} X^1 & X^2 & \dots & X^d \end{pmatrix} = \begin{pmatrix} X^1X^1 & X^1X^2 & \dots & X^1X^d \\ X^2X^1 & X^2X^2 & \dots & X^2X^d \\ \vdots & \vdots & \ddots & \vdots \\ X^dX^1 & X^dX^2 & \dots & X^dX^d \end{pmatrix}

The covariance matrix can also be written as:

Σ=E[(XE[X])(XE[X])]\begin{equation} \Sigma = \mathbb{E}[(X - \mathbb{E}[X])(X - \mathbb{E}[X])^\intercal] \end{equation}

Note: This is because (XE[X])i=XiE[Xi]=XiE[X]i(X - \mathbb{E}[X])_{i} = X_i - \mathbb{E}[X_i] = X_i - \mathbb{E}[X]_i.

Just to verify we will expand the right hand side of the equation:

E[(XE[X])(XE[X])]=E[(XE[X])(XE[X])]=E[XXXE[X]E[X]X+E[X]E[X]]=E[XX]E[X]E[X]E[X]E[X]+E[X]E[X]=E[XX]E[X]E[X]\begin{align*} \mathbb{E}[(X - \mathbb{E}[X])(X - \mathbb{E}[X])^\intercal] &= \mathbb{E}[(X - \mathbb{E}[X])(X^\intercal - \mathbb{E}[X]^\intercal)] \\ &= \mathbb{E}[XX^\intercal - X\mathbb{E}[X]^\intercal - \mathbb{E}[X]X^\intercal + \mathbb{E}[X]\mathbb{E}[X]^\intercal] \\ &= \mathbb{E}[XX^\intercal] - \mathbb{E}[X]\mathbb{E}[X]^\intercal - \mathbb{E}[X]\mathbb{E}[X]^\intercal + \mathbb{E}[X]\mathbb{E}[X]^\intercal \\ &= \mathbb{E}[XX^\intercal] - \mathbb{E}[X]\mathbb{E}[X]^\intercal \end{align*}

Empirical Estimation & Reviewing Linear Algebra

Let X=(X1X2Xn)Rn×d\mathbb{X} = \begin{pmatrix} \dots & {X_1}^\intercal & \dots \\ \dots & {X_2}^\intercal & \dots \\ \dots & \vdots & \dots \\ \dots & {X_n}^\intercal & \dots \end{pmatrix} \in \R^{n \times d} be a matrix that contains nn realizations of XX.

We will use the subscript notation to denote the nn observations of XX. These X1,X2,,XnX_1, X_2, \dots, X_n are assumed to be independent and identically distributed (i.i.d.) random vectors with the same distribution as the random variable XX.

Since we don't have access to the true distribution of XX, we will use the empirical distribution of X\mathbb{X} to estimate the expectation and covariance matrix of XX.

The empirical expectation of XX is denoted by Xˉ\bar{X} and is defined as:

Xˉ=1ni=1nXi=1nX1ln\newcommand\identity{1\kern-0.25em\text{l}} \begin{align} \bar{X} &= \frac{1}{n} \sum_{i=1}^n X_i \\ &= \frac{1}{n} \mathbb{X}^\intercal \identity_n \end{align}

where 1ln=(11)Rn\newcommand\identity{1\kern-0.25em\text{l}} \identity_n = \begin{pmatrix} 1 \\ \vdots \\ 1 \end{pmatrix} \in \R^n is a vector of ones.

Note: This can be explained by:

X1ln=(||||X1iX2iXni||||)(11)=(|j=1nXji|)=(|nXˉi|)=nXˉ\mathbb{X}^\intercal \newcommand\identity{1\kern-0.25em\text{l}} \identity_n = \begin{pmatrix} \text{\textbar} & \text{\textbar} & \text{\textbar} & \text{\textbar} \\ X^i_1 & X^i_2 & \dots & X^i_n \\ \text{\textbar} & \text{\textbar} & \text{\textbar} & \text{\textbar} \end{pmatrix} \begin{pmatrix} 1 \\ \vdots \\ 1 \end{pmatrix} = \begin{pmatrix} \text{\textbar} \\ \sum_{j=1}^n X^i_j \\ \text{\textbar} \end{pmatrix} = \begin{pmatrix} \text{\textbar} \\ n\bar{X}^i \\ \text{\textbar}\end{pmatrix} = n\bar{X}

The empirical covariance matrix of XX is denoted by SS and is defined as:

S=1ni=1nXiXiXˉXˉ=1ni=1n(XiXˉ)(XiXˉ)\begin{align} S &= \frac{1}{n} \sum_{i=1}^n X_i{X_i}^\intercal - \bar{X}\bar{X}^\intercal \\ &= \frac{1}{n} \sum_{i=1}^n (X_i - \bar{X})(X_i - \bar{X})^\intercal \\ \end{align}

Again, performing further vectorization of equation (5)(5), we get:

S=1nXXXˉXˉ\begin{equation} S = \frac{1}{n} \mathbb{X}^\intercal \mathbb{X} - \bar{X}\bar{X}^\intercal \end{equation}

Replacing value of Xˉ\bar{X} from equation (4)(4), we get:

S=1nXX1n2X1ln(X1ln)=1nXX1n2X1ln1lnX=1nX(In1n1ln1ln)X=1nXHX\newcommand\identity{1\kern-0.25em\text{l}} \begin{align} S &= \frac{1}{n} \mathbb{X}^\intercal \mathbb{X} - \frac{1}{n^2} \mathbb{X}^\intercal \identity_n (\mathbb{X} \identity_n)^\intercal \\ &= \frac{1}{n} \mathbb{X}^\intercal \mathbb{X} - \frac{1}{n^2} \mathbb{X}^\intercal \identity_n \identity_n^\intercal \mathbb{X} \\ &= \frac{1}{n} \mathbb{X}^\intercal \left( \mathbb{I}_n - \frac{1}{n} \identity_n \identity_n^\intercal \right) \mathbb{X} \\ &= \frac{1}{n} \mathbb{X}^\intercal H \mathbb{X} \end{align}

where HH is a matrix such that Hii=11nH_{ii} = 1 - \frac{1}{n} and Hij=1nH_{ij} = -\frac{1}{n} for iji \neq j.

Properties of HH

Orthogonal Projector

Matrix HH can be shown as a orthogonal projector. Since it is symmetric, H=HH^\intercal = H, so it suffices to show H2=HH^2 = H.

H2=(In1n1ln1ln)(In1n1ln1ln)=In2n1ln1ln+1n21ln(1ln1ln)1ln=In2n1ln1ln+nn21ln1ln   [ 1ln1ln=n ]=In1n1ln1ln=H\newcommand\identity{1\kern-0.25em\text{l}} \begin{align*} H^2 &= \left( \mathbb{I}_n - \frac{1}{n} \identity_n \identity_n^\intercal \right) \left( \mathbb{I}_n - \frac{1}{n} \identity_n \identity_n^\intercal \right) \\ &= \mathbb{I}_n - \frac{2}{n} \identity_n \identity_n^\intercal + \frac{1}{n^2} \identity_n \left( \identity_n^\intercal \identity_n \right) \identity_n^\intercal \\ &= \mathbb{I}_n - \frac{2}{n} \identity_n \identity_n^\intercal + \frac{n}{n^2} \identity_n \identity_n^\intercal ~~~[~\because \identity_n^\intercal \identity_n = n~]\\ &= \mathbb{I}_n - \frac{1}{n} \identity_n \identity_n^\intercal \\ &= H \end{align*}

Projection Space

Let's take vRdv \in \R^d. We have:

Hv=(In1n1ln1ln)v=v1n1ln1lnv=v(1nv1ln)1ln=vvˉ1ln\newcommand\identity{1\kern-0.25em\text{l}} \begin{align*} Hv &= \left( \mathbb{I}_n - \frac{1}{n} \identity_n \identity_n^\intercal \right) v \\ &= v - \frac{1}{n} \identity_n \identity_n^\intercal v \\ &= v - \left(\frac{1}{n} v^\intercal \identity_n \right) \identity_n \\ &= v - \bar{v} \identity_n \end{align*}

where vˉ=1ni=1nvi\bar{v} = \frac{1}{n} \sum_{i=1}^n v_i. Therefore, this projector is removing the average of vv from from each of its coordinates.

Note: It projects onto the subspace of vectors having zero mean. It means:

vspan{1ln}\newcommand\identity{1\kern-0.25em\text{l}} \begin{align*} v \perp \text{span} \left\{ \identity_n \right\} \end{align*}

Switching to Statistics

Let uRdu \in \R^d, then we can show that uΣuu^\intercal \Sigma u is the variance of uXu^\intercal X.

uΣu=u(E[XX]E[X]E[X])u=uE[XX]uuE[X]E[X]u=E[uXXu]E[uX]2=E[uX](E[uX])2=Var(uX)\begin{align*} u^\intercal \Sigma u &= u^\intercal \left( \mathbb{E}[XX^\intercal] - \mathbb{E}[X] \mathbb{E}[X]^\intercal \right) u \\ &= u^\intercal \mathbb{E}[XX^\intercal] u - u^\intercal \mathbb{E}[X] \mathbb{E}[X]^\intercal u \\ &= \mathbb{E}[u^\intercal XX^\intercal u] - \mathbb{E}[u^\intercal X]^2 \\ &= \mathbb{E}[u^\intercal X] - (\mathbb{E}[u^\intercal X])^2 \\ &= \text{Var}(u^\intercal X) \end{align*}

With a similar argument, we can show that uSuu^\intercal S u is the sample variance of (uX1,,uXn)R(u^\intercal X_1, \ldots, u^\intercal X_n) \in \R. This gives us the variance of XX along the direction of uu.


Principal Component Analysis

PCA is an unsupervised linear transformation technique that allows us to reduce the dimensionality of a dataset while retaining as much information as possible. The core idea behind PCA is to use variance as a measure of spread in the data.

PCA identifies the directions of maximum variance in the data and projects it onto a new orthogonal basis with same or lesser dimensions, which is identified by the principal components.

Let's write down the maximization problem more formally:

maxuRduSu    s.t. uu=1\begin{align} \max_{u \in \R^d} \quad & u^\intercal S u \\~~\text{~~s.t.~} u^\intercal u = 1 \end{align}

The constraint uu=1u^\intercal u = 1 is added to ensure that the solution is not affected by the magnitude of uu.

Spectral Theorem

If S is a symmetric matrix with real components, then there exists an orthogonal matrix PP and a diagonal matrix Λ\Lambda such that:

S=PΛPS = P \Lambda P^\intercal

where the columns of PP are the eigenvectors of SS such that vivj=0{v_i}^\intercal v_j = 0 and vivi=1{v_i}^\intercal v_i = 1 and the diagonal entries of Λ\Lambda are the corresponding eigenvalues.

The fact is that viΣvi=λivivi=λi{v_i}^\intercal \Sigma v_i = \lambda_i {v_i}^\intercal v_i = \lambda_i because Pvi=viP^\intercal v_i = v_i by orthogonality of vectors. Thus, the variance of XX along the eigenvector viv_i is carried by the associated eigenvalue.

Solution

Equation (12)(12) is a constrained optimization problem. We can solve it using Lagrange multipliers. Let's define the Lagrangian:

L(u,λ)=uSuλ(uu1)=uSuλuu+λ\begin{align*} \mathcal{L}(u, \lambda) &= u^\intercal S u - \lambda (u^\intercal u - 1) \\ &= u^\intercal S u - \lambda u^\intercal u + \lambda \end{align*}

Now, we can take the derivative of L\mathcal{L} with respect to uu and set it to zero:

Lu=2Su2λu=0    Su=λu\begin{align*} \frac{\partial \mathcal{L}}{\partial u} &= 2Su - 2\lambda u \\ &= 0 \\ \implies Su &= \lambda u \end{align*}

This is an eigenvalue problem. The eigenvector uu that corresponds to the largest eigenvalue of SS is the solution to the optimization problem. This eigenvector is called the first principal component of XX.

PCA Algorithm

Let us assume we are given input data: X1,,XnX_1 , \dots , X_n. Assume that they are cloud of nn points in dimension dd. We need to reduce its dimension to kk s.t. kdk \leq d.

The algorithm is as follows:

  1. Compute the empirical covariance matrix SS.
  2. Compute the spectral decomposition of SS:
S=PDP with D=diag(λ1,...,λd) and λ1λ2...λdS = PDP^\intercal \\ \text{~with~} D = \text{diag}(\lambda_1, . . . , \lambda_d) \text{~and~} \lambda_1 \geq \lambda_2 \geq ... ≥ \lambda_d
  1. Choose a set k<dk < d and set Pk=[v1,,vk]Rd×kP_k = [v_1, \dots , v_k] \in \R^{d \times k}.

  2. We have Z1,,ZnZ_1, \dots , Z_n where:

Zi=PkXiRkZ_i = P_k^\intercal X_i \in \R^k