These are my notes that cover the mathematical foundations of a dimensionality reduction method called Principal Component Analysis (PCA), taken while attending CSCI-UA 9473 - Foundations of Machine Learning at NYU Paris. They make use of linear algebra and statistics to formalize the concept of PCA.

## Multivariate Statistics & Notation

Let $X = \begin{pmatrix} X^1 \\ X^2 \\ \vdots \\ X^d \end{pmatrix} \in \R^d$ be a random vector. We will use the superscript notation to denote the $d$ components of $X$.

The expectation of $X$ is defined as:

$\mathbb{E}[X] = \begin{pmatrix} \mathbb{E}[X^1] \\ \mathbb{E}[X^2] \\ \vdots \\ \mathbb{E}[X^d] \end{pmatrix} \in \R^d$

Similarly, the covariance matrix of $X$, denoted by $\Sigma$, is a $d \times d$ matrix defined such that:

$\Sigma_{ij} = \sigma_{ij} = \text{Cov}(X^i, X^j) = \mathbb{E}[X^iX^j] - \mathbb{E}[X^i]\mathbb{E}[X^j]$

We can write the whole covariance matrix in the following vectorized form:

$\begin{equation} \Sigma = \mathbb{E}[XX^\intercal] - \mathbb{E}[X]\mathbb{E}[X]^\intercal \end{equation}$

Note: This is because $(\mathbb{E}[XX^\intercal])_{ij} = \mathbb{E}[(XX^\intercal)_{ij}] = \mathbb{E}[X^iX^j]$. Recall that:

$XX^\intercal = \begin{pmatrix} X^1 \\ X^2 \\ \vdots \\ X^d \end{pmatrix} \begin{pmatrix} X^1 & X^2 & \dots & X^d \end{pmatrix} = \begin{pmatrix} X^1X^1 & X^1X^2 & \dots & X^1X^d \\ X^2X^1 & X^2X^2 & \dots & X^2X^d \\ \vdots & \vdots & \ddots & \vdots \\ X^dX^1 & X^dX^2 & \dots & X^dX^d \end{pmatrix}$

The covariance matrix can also be written as:

$\begin{equation} \Sigma = \mathbb{E}[(X - \mathbb{E}[X])(X - \mathbb{E}[X])^\intercal] \end{equation}$

Note: This is because $(X - \mathbb{E}[X])_{i} = X_i - \mathbb{E}[X_i] = X_i - \mathbb{E}[X]_i$.

Just to verify we will expand the right hand side of the equation:

\begin{align*} \mathbb{E}[(X - \mathbb{E}[X])(X - \mathbb{E}[X])^\intercal] &= \mathbb{E}[(X - \mathbb{E}[X])(X^\intercal - \mathbb{E}[X]^\intercal)] \\ &= \mathbb{E}[XX^\intercal - X\mathbb{E}[X]^\intercal - \mathbb{E}[X]X^\intercal + \mathbb{E}[X]\mathbb{E}[X]^\intercal] \\ &= \mathbb{E}[XX^\intercal] - \mathbb{E}[X]\mathbb{E}[X]^\intercal - \mathbb{E}[X]\mathbb{E}[X]^\intercal + \mathbb{E}[X]\mathbb{E}[X]^\intercal \\ &= \mathbb{E}[XX^\intercal] - \mathbb{E}[X]\mathbb{E}[X]^\intercal \end{align*}

### Empirical Estimation & Reviewing Linear Algebra

Let $\mathbb{X} = \begin{pmatrix} \dots & {X_1}^\intercal & \dots \\ \dots & {X_2}^\intercal & \dots \\ \dots & \vdots & \dots \\ \dots & {X_n}^\intercal & \dots \end{pmatrix} \in \R^{n \times d}$ be a matrix that contains $n$ realizations of $X$.

We will use the subscript notation to denote the $n$ observations of $X$. These $X_1, X_2, \dots, X_n$ are assumed to be independent and identically distributed (i.i.d.) random vectors with the same distribution as the random variable $X$.

Since we don't have access to the true distribution of $X$, we will use the empirical distribution of $\mathbb{X}$ to estimate the expectation and covariance matrix of $X$.

The empirical expectation of $X$ is denoted by $\bar{X}$ and is defined as:

\newcommand\identity{1\kern-0.25em\text{l}} \begin{align} \bar{X} &= \frac{1}{n} \sum_{i=1}^n X_i \\ &= \frac{1}{n} \mathbb{X}^\intercal \identity_n \end{align}

where $\newcommand\identity{1\kern-0.25em\text{l}} \identity_n = \begin{pmatrix} 1 \\ \vdots \\ 1 \end{pmatrix} \in \R^n$ is a vector of ones.

Note: This can be explained by:

$\mathbb{X}^\intercal \newcommand\identity{1\kern-0.25em\text{l}} \identity_n = \begin{pmatrix} \text{\textbar} & \text{\textbar} & \text{\textbar} & \text{\textbar} \\ X^i_1 & X^i_2 & \dots & X^i_n \\ \text{\textbar} & \text{\textbar} & \text{\textbar} & \text{\textbar} \end{pmatrix} \begin{pmatrix} 1 \\ \vdots \\ 1 \end{pmatrix} = \begin{pmatrix} \text{\textbar} \\ \sum_{j=1}^n X^i_j \\ \text{\textbar} \end{pmatrix} = \begin{pmatrix} \text{\textbar} \\ n\bar{X}^i \\ \text{\textbar}\end{pmatrix} = n\bar{X}$

The empirical covariance matrix of $X$ is denoted by $S$ and is defined as:

\begin{align} S &= \frac{1}{n} \sum_{i=1}^n X_i{X_i}^\intercal - \bar{X}\bar{X}^\intercal \\ &= \frac{1}{n} \sum_{i=1}^n (X_i - \bar{X})(X_i - \bar{X})^\intercal \\ \end{align}

Again, performing further vectorization of equation $(5)$, we get:

$\begin{equation} S = \frac{1}{n} \mathbb{X}^\intercal \mathbb{X} - \bar{X}\bar{X}^\intercal \end{equation}$

Replacing value of $\bar{X}$ from equation $(4)$, we get:

\newcommand\identity{1\kern-0.25em\text{l}} \begin{align} S &= \frac{1}{n} \mathbb{X}^\intercal \mathbb{X} - \frac{1}{n^2} \mathbb{X}^\intercal \identity_n (\mathbb{X} \identity_n)^\intercal \\ &= \frac{1}{n} \mathbb{X}^\intercal \mathbb{X} - \frac{1}{n^2} \mathbb{X}^\intercal \identity_n \identity_n^\intercal \mathbb{X} \\ &= \frac{1}{n} \mathbb{X}^\intercal \left( \mathbb{I}_n - \frac{1}{n} \identity_n \identity_n^\intercal \right) \mathbb{X} \\ &= \frac{1}{n} \mathbb{X}^\intercal H \mathbb{X} \end{align}

where $H$ is a matrix such that $H_{ii} = 1 - \frac{1}{n}$ and $H_{ij} = -\frac{1}{n}$ for $i \neq j$.

### Properties of $H$

#### Orthogonal Projector

Matrix $H$ can be shown as a orthogonal projector. Since it is symmetric, $H^\intercal = H$, so it suffices to show $H^2 = H$.

\newcommand\identity{1\kern-0.25em\text{l}} \begin{align*} H^2 &= \left( \mathbb{I}_n - \frac{1}{n} \identity_n \identity_n^\intercal \right) \left( \mathbb{I}_n - \frac{1}{n} \identity_n \identity_n^\intercal \right) \\ &= \mathbb{I}_n - \frac{2}{n} \identity_n \identity_n^\intercal + \frac{1}{n^2} \identity_n \left( \identity_n^\intercal \identity_n \right) \identity_n^\intercal \\ &= \mathbb{I}_n - \frac{2}{n} \identity_n \identity_n^\intercal + \frac{n}{n^2} \identity_n \identity_n^\intercal ~~~[~\because \identity_n^\intercal \identity_n = n~]\\ &= \mathbb{I}_n - \frac{1}{n} \identity_n \identity_n^\intercal \\ &= H \end{align*}

#### Projection Space

Let's take $v \in \R^d$. We have:

\newcommand\identity{1\kern-0.25em\text{l}} \begin{align*} Hv &= \left( \mathbb{I}_n - \frac{1}{n} \identity_n \identity_n^\intercal \right) v \\ &= v - \frac{1}{n} \identity_n \identity_n^\intercal v \\ &= v - \left(\frac{1}{n} v^\intercal \identity_n \right) \identity_n \\ &= v - \bar{v} \identity_n \end{align*}

where $\bar{v} = \frac{1}{n} \sum_{i=1}^n v_i$. Therefore, this projector is removing the average of $v$ from from each of its coordinates.

Note: It projects onto the subspace of vectors having zero mean. It means:

\newcommand\identity{1\kern-0.25em\text{l}} \begin{align*} v \perp \text{span} \left\{ \identity_n \right\} \end{align*}

### Switching to Statistics

Let $u \in \R^d$, then we can show that $u^\intercal \Sigma u$ is the variance of $u^\intercal X$.

\begin{align*} u^\intercal \Sigma u &= u^\intercal \left( \mathbb{E}[XX^\intercal] - \mathbb{E}[X] \mathbb{E}[X]^\intercal \right) u \\ &= u^\intercal \mathbb{E}[XX^\intercal] u - u^\intercal \mathbb{E}[X] \mathbb{E}[X]^\intercal u \\ &= \mathbb{E}[u^\intercal XX^\intercal u] - \mathbb{E}[u^\intercal X]^2 \\ &= \mathbb{E}[u^\intercal X] - (\mathbb{E}[u^\intercal X])^2 \\ &= \text{Var}(u^\intercal X) \end{align*}

With a similar argument, we can show that $u^\intercal S u$ is the sample variance of $(u^\intercal X_1, \ldots, u^\intercal X_n) \in \R$. This gives us the variance of $X$ along the direction of $u$.

## Principal Component Analysis

PCA is an unsupervised linear transformation technique that allows us to reduce the dimensionality of a dataset while retaining as much information as possible. The core idea behind PCA is to use variance as a measure of spread in the data.

PCA identifies the directions of maximum variance in the data and projects it onto a new orthogonal basis with same or lesser dimensions, which is identified by the principal components.

Let's write down the maximization problem more formally:

\begin{align} \max_{u \in \R^d} \quad & u^\intercal S u \\~~\text{~~s.t.~} u^\intercal u = 1 \end{align}

The constraint $u^\intercal u = 1$ is added to ensure that the solution is not affected by the magnitude of $u$.

### Spectral Theorem

If S is a symmetric matrix with real components, then there exists an orthogonal matrix $P$ and a diagonal matrix $\Lambda$ such that:

$S = P \Lambda P^\intercal$

where the columns of $P$ are the eigenvectors of $S$ such that ${v_i}^\intercal v_j = 0$ and ${v_i}^\intercal v_i = 1$ and the diagonal entries of $\Lambda$ are the corresponding eigenvalues.

The fact is that ${v_i}^\intercal \Sigma v_i = \lambda_i {v_i}^\intercal v_i = \lambda_i$ because $P^\intercal v_i = v_i$ by orthogonality of vectors. Thus, the variance of $X$ along the eigenvector $v_i$ is carried by the associated eigenvalue.

### Solution

Equation $(12)$ is a constrained optimization problem. We can solve it using Lagrange multipliers. Let's define the Lagrangian:

\begin{align*} \mathcal{L}(u, \lambda) &= u^\intercal S u - \lambda (u^\intercal u - 1) \\ &= u^\intercal S u - \lambda u^\intercal u + \lambda \end{align*}

Now, we can take the derivative of $\mathcal{L}$ with respect to $u$ and set it to zero:

\begin{align*} \frac{\partial \mathcal{L}}{\partial u} &= 2Su - 2\lambda u \\ &= 0 \\ \implies Su &= \lambda u \end{align*}

This is an eigenvalue problem. The eigenvector $u$ that corresponds to the largest eigenvalue of $S$ is the solution to the optimization problem. This eigenvector is called the first principal component of $X$.

## PCA Algorithm

Let us assume we are given input data: $X_1 , \dots , X_n$. Assume that they are cloud of $n$ points in dimension $d$. We need to reduce its dimension to $k$ s.t. $k \leq d$.

The algorithm is as follows:

1. Compute the empirical covariance matrix $S$.
2. Compute the spectral decomposition of $S$:
$S = PDP^\intercal \\ \text{~with~} D = \text{diag}(\lambda_1, . . . , \lambda_d) \text{~and~} \lambda_1 \geq \lambda_2 \geq ... ≥ \lambda_d$
1. Choose a set $k < d$ and set $P_k = [v_1, \dots , v_k] \in \R^{d \times k}$.

2. We have $Z_1, \dots , Z_n$ where:

$Z_i = P_k^\intercal X_i \in \R^k$