PCA, EigenFace and All That

July 27, 2013 · ∞ -

PCA(Principal Component Analysis) is one of the most commonly used unsupervised learning algorithm to compress, extract features for data and even for dimensionality reduction purposes. It has quite a lof of applications as follows:

Applications

Useful for compression and classfication of data.
Aim is to reduce the number of variables, but at the same time, preserve most of information (variation), which may not necessarily hold true in general.
New variables called principal components(PC) are uncorrelated, are ordered by fraction of total information each retains.
PC’s are a series of linear least squares fits to a sample, each orthogonal to all previous.
Identify how different variables work together to create the dynamics of the system.
Reduce the dimensionality of the data.
Decrease the redundancy of the data.
Filter some of the noise in the data.
Compress the data.
Prepare the data for further analysis using other techniques.

What does it do in the first place?

Functions of PCA

Is to reduce dimensionality by extracting the smallest number components that account for most of the variation in the original data. By doing so, we’d get get rid of the redundancy and preserve the variance in a smaller number of coefficients.
PCA finds lines(2-d), planes(3-d) in a higher dimensional spaces that approximate the data in least squares( $l_2$ norm).

Why do we choose PCA over other transformations which turn the original variables into a representation which depend on orthogonal bases, say Fourier Transform?

Using a set of fixed set of components will give a good reconstruction of the same data(at least in least square sense, $l_2$ norm). However, Fourier transform does not guarantee such a premise.
If the data has a lot of correlation among its variables(redundancy), then PCA could exploit this redundancy by uncorrelating the variables whereas Fourier transform cannot exploit this redundancy(at least explicitly).
When PCA is used for dimensionality reduction, it is quite good at preserving the distance between the observations in the projection space.

What about the disadvantages?

The components are not independent but uncorrelated. It would be even better if we have a representation which are independent to each other. It is called unsurprisingly Independent Component Analysis.
PCA seeks for linear combinations of the original variables. The nonlinear combination may even yield better representation. PCA has an extension for doing this type of analysis, Nonlinear PCA.
Instead of $l_2$ norm, it may be advantageous to use $l_1$ norm. Especially, if the signal that we want to represent is sparse or has a sparse representation in some other space. PCA is extended for this specific problem as well, which is called Sparse PCA.

Assumptions that PCA does:

Assumption 1: In general, high correlation between variables could be a sign of high redundancy.
Assumption 2: The most important dynamics are the ones with the largest variance.

Eigenvalues and Eigenvectors

Eigenvalues measure the amount of variation(information) explaiend by each principal components and will be largest for the first PC and smaller for the subsequent PCs.
An eigenvalue greater than 1 indicates that principal component accounts for more variance than accounted by one of the original variables in standardized data. This could be used to threshold to determine the number of eigenvectors.
Eigenvectors provide the weights to compute the uncorrelated principal components, which are the linear combination of the original variables.

What does PCA do (in a nutshell)?

PCA transforms the observations into uncorrelated components, which are nothing but linear combination of observations.

Caveats

PCA is sensitive to scaling.

By modifying the variance of the variables(scaling), it is possible to attribute different importance to them. By doing so, the prior information or the belief on the importance of the attributes can be preserved even in the PCA stage.

Definition of Principal Components

Given a sample $n$ observations on a vector $p$ variables:
$x = (x_1, x_2, \ldots, x_p)$ define the principal component of the sample by a linear transformation which is given as following:
$z = a^T x = \displaystyle\sum_{i=1}^p a_i x_i$ where $a$ is $a = (a_1, a_2, \ldots, a_p)$ which is chosen to maximize the variance of $z$ and subject to $cov[z_k, z_l] = 0, k \gt l \geq 1$ and to $a^T a = 1$

How to derive the coefficients $a$ ?

The variance is:
$var[z] = \lt z^2\gt - \lt z \gt^2$ $= \displaystyle \sum_{i,j = 1}^p a_i a_j \lt x_i x_j \gt - \displaystyle \sum_{i,j=1}^p a_i a_j \lt x_i \gt \lt x_j \gt$ $= \displaystyle\sum_{i,j = 1}^p a_i a_j S_{ij}$ where $S_{ij} = \lt x_i x_j \gt - \lt x_i \gt \lt x_j \gt$ is the covariance matrix for $x$ .
To find $a$ which maximizes $var[z]$ subject to $a^T a = 1$ , let us introduce a lagrange multiplier $\lambda$ . Then, the maximization equation becomes $a^T S a - \lambda (a^T a - 1)$ If we take the derivative with respect to $a$ , then the equation becomes $S a - \lambda a = 0$ $(S - \lambda I) a = 0$ Therefore, $a$ is an eigenvector of $S$ which has the corresponding value of $\lambda$ .

In short, we are interested in representing the most of the variation in the data by transforming the original variables into principal components. These components are orthogonal and ordered by magnitude so that the first few of them could explain most of the variation in the original observation.

EigenFace

PCA has a very good application which is in the computer vision domain, called EigenFace, short version. Eigenface is a name for eigenvectors which are the components of the face itself. It has been used for face recognition where the most variations considered as important. It was quite successful as well for some 20 years ago although it was replaced then by other methods. In this implementation, I used a particular subset of Yale Face Database.
The images that I used are given below: Alt text