These are my notes that cover the mathematical foundations of a dimensionality reduction method called Principal Component Analysis (PCA), taken while attending CSCI-UA 9473 - Foundations of Machine Learning at NYU Paris. They make use of linear algebra and statistics to formalize the concept of PCA.
Multivariate Statistics & Notation
Let X=⎝⎛X1X2⋮Xd⎠⎞∈Rd be a random vector. We will use the superscript notation to denote the d components of X.
The expectation of X is defined as:
E[X]=⎝⎛E[X1]E[X2]⋮E[Xd]⎠⎞∈Rd
Similarly, the covariance matrix of X, denoted by Σ, is a d×d matrix defined such that:
Σij=σij=Cov(Xi,Xj)=E[XiXj]−E[Xi]E[Xj]
We can write the whole covariance matrix in the following vectorized form:
Σ=E[XX⊺]−E[X]E[X]⊺
Note: This is because (E[XX⊺])ij=E[(XX⊺)ij]=E[XiXj]. Recall that:
Let X=⎝⎛…………X1⊺X2⊺⋮Xn⊺…………⎠⎞∈Rn×d be a matrix that contains n realizations of X.
We will use the subscript notation to denote the n observations of X. These X1,X2,…,Xn are assumed to be independent and identically distributed (i.i.d.) random vectors with the same distribution as the random variable X.
Since we don't have access to the true distribution of X, we will use the empirical distribution of X to estimate the expectation and covariance matrix of X.
The empirical expectation of X is denoted by Xˉ and is defined as:
With a similar argument, we can show that u⊺Su is the sample variance of (u⊺X1,…,u⊺Xn)∈R. This gives us the variance of X along the direction of u.
Principal Component Analysis
PCA is an unsupervised linear transformation technique that allows us to reduce the dimensionality of a dataset while retaining as much information as possible. The core idea behind PCA is to use variance as a measure of spread in the data.
PCA identifies the directions of maximum variance in the data and projects it onto a new orthogonal basis with same or lesser dimensions, which is identified by the principal components.
Let's write down the maximization problem more formally:
u∈Rdmaxs.t.u⊺u=1u⊺Su
The constraint u⊺u=1 is added to ensure that the solution is not affected by the magnitude of u.
Spectral Theorem
If S is a symmetric matrix with real components, then there exists an orthogonal matrix P and a diagonal matrix Λ such that:
S=PΛP⊺
where the columns of P are the eigenvectors of S such that vi⊺vj=0 and vi⊺vi=1 and the diagonal entries of Λ are the corresponding eigenvalues.
The fact is that vi⊺Σvi=λivi⊺vi=λi because P⊺vi=vi by orthogonality of vectors. Thus, the variance of X along the eigenvector vi is carried by the associated eigenvalue.
Solution
Equation (12) is a constrained optimization problem. We can solve it using Lagrange multipliers. Let's define the Lagrangian:
L(u,λ)=u⊺Su−λ(u⊺u−1)=u⊺Su−λu⊺u+λ
Now, we can take the derivative of L with respect to u and set it to zero:
∂u∂L⟹Su=2Su−2λu=0=λu
This is an eigenvalue problem. The eigenvector u that corresponds to the largest eigenvalue of S is the solution to the optimization problem. This eigenvector is called the first principal component of X.
PCA Algorithm
Let us assume we are given input data: X1,…,Xn. Assume that they are cloud of n points in dimension d. We need to reduce its dimension to k s.t. k≤d.