Machine Learning Newsletter

Geometric Take on PCA

If linear regression is one of the most commonly used method for regression and in general for supervised learning, Principal Component Analysis (PCA) would be the equivalent of linear regression in the unsupervised learning for mainly dimensionality reduction. It could be used a variety of problems, though. To name a few; data preprocessing(removing the correlations between variables), feature extraction and anomaly detection. That being said the learning algorithm is useful for some subset of problems even though it is not a silver bullet that solves all of your problems.

Popularity/methodology of the algorithm gets some harsh criticism from some top-notch machine learning researchers as well:

David MacKay responds to cricism on how he did not include PCA in his book:

"Principal Component Analysis" is a dimensionally invalid method that gives people a delusion that they are doing something useful with their data. If you change the units that one of the variables is measured in, it will change all the "principal components"! It's for that reason that I made no mention of PCA in my book. I am not a slavish conformist, regurgitating whatever other people think should be taught. I think before I teach.

I digress.

Let's shed some light on the problems it could be useful for.

Among many of its definitions, PCA could be defined as a orthogonal linear transformation where the coordinates are determined based on "principal components" which are ordered based on the variance they capture/explain in the data.

It could be used to learn some of the transformations that data underwent as well or normalize the data in a way so that the correlations among variables of interest would be removed.

More formally, PCA could be defined in the following way: Assume we have an input data matrix which has $N$ points $X$: $$ C = \frac{1}{N}\displaystyle\sum_{i=1}^N (x_n - \mu) (x_n - \mu)^T $$ and we want to find a $P$ matrix which will have principal components $$ C = P \Lambda P^T $$ where $\Lambda$ is a diagonal matrix with elements $ {\lambda_1, \lambda_2, \ldots, \lambda_K} $ and $\lambda_1 \geq \lambda_2 \geq \ldots \geq \lambda_K $

This problem formulation and finding $P$ and $\Lambda$ could be solved by eigenvalue decomposition. This problem formulation is same an eigenvalue decomposition, $det (C - \lambda_1 I)p_1 =0 $ where $p_1$ is the eigenvector and $\lambda_1$ is the eigenvalue. Singular Value Decomposition is generally preferred method because then(? what a sentence ?) the dimension number exceeds number of observations in the matrix $X$, the decomposition is easier than solving the covariance matrix. I will be using svd in numpy to compute the principal components in the toy signal as well.

Rotation could be defined as a matrix multiplication in Cartesian coordinates, the following matrix rotates data in counter-clockwise by $\theta$; $$ R = \begin{bmatrix} \cos \theta -\sin \theta \\ \sin \theta \cos \theta \\ \end{bmatrix} $$

For clockwise, $\theta$ only gets replaced by $-\theta$.

If we want to rotate x and y axises with $\theta$ degree, the matrix equality could be written: $$ \begin{bmatrix} x' \\ y' \\ \end{bmatrix} = \begin{bmatrix} \cos \theta -\sin \theta \\ \sin \theta \cos \theta \\ \end{bmatrix}\begin{bmatrix} x \\ y \\ \end{bmatrix} $$

Then, new $x'$ and $y'$ is immediate as in the following: $$ x' = x \cos \theta - y \sin \theta\ $$

$$ y' = x \sin \theta + y \cos \theta\ $$

If we want to learn the rotation in the signal by just looking at the data(assuming the original signal has principal components in the $x$ any $y$ axises), we could do that using PCA. As PCA will try to "normalize" the data into the axises along with principal components, we could look at the angle difference between the normalized axises to measure the rotation. If we are interested in measuring the strangeth of each principal component, PCA will give the explained variance along the process as well.

In [1]:
%matplotlib inline

import copy
import numpy as np
from matplotlib import pyplot as plt
import matplotlib as mlp

# Matplotlib kind of rocks!'fivethirtyeight')

GLOBAL_FIGURE_STYLING = dict(figsize=(16, 12), facecolor='#eeeeee')

                                    arrowprops=dict(arrowstyle='<->', color='#A60628', lw=1))

ROTATED_AXIS_ANNOTATION_STYLING['arrowprops']['color'] = '#7A68A6'
ELLIPSE_STYLING = dict(ec='k', fc='#348ABD', alpha=0.2, zorder=1)
SCATTER_STYLING = dict(s=25, lw=0, c='#188487', zorder=2)

radius = 0.8
fig_dim = radius + 0.5

signal_length = 10000
n_dimension = 2
sigma1 = 0.30
sigma2 = 0.10
rotation = np.pi / 4. # Give 45 degree rotation 

sinus_rotation = np.sin(rotation)
cosinus_rotation = np.cos(rotation)
In [2]:
signal = np.random.normal(0, [sigma1, sigma2], size=(signal_length, n_dimension)).T
_, ax = plt.subplots(**GLOBAL_FIGURE_STYLING)
ax.scatter(signal[0], signal[1], **SCATTER_STYLING)
ax.set_title('Original Signal');

Original signal is normally distributed and if we get PCA of this Gaussian Signal without rotating it, we would get 0 and 90 degrees as our principal components as it actually lies on x and y axises.

In [3]:
rotation_matrix = np.array([[cosinus_rotation, -sinus_rotation],
                            [sinus_rotation, cosinus_rotation]])
signal =, signal)

_, ax = plt.subplots(**GLOBAL_FIGURE_STYLING)
ax.scatter(signal[0], signal[1], **SCATTER_STYLING)
ax.set_title('Rotated Signal');
In [4]:
_, ax = plt.subplots(**GLOBAL_FIGURE_STYLING)

ax.annotate(r'$x$', (-radius, 0), (radius, 0), **ORIGINAL_AXIS_ANNOTATION_STYLING)
ax.annotate(r'$y$', (0, -radius), (0, radius), **ORIGINAL_AXIS_ANNOTATION_STYLING)

ax.annotate(r'$x^\prime$', (-radius * cosinus_rotation, -radius * sinus_rotation), 
                           (radius * cosinus_rotation, radius * sinus_rotation),
ax.annotate(r'$y^\prime$', (radius * sinus_rotation, -radius * cosinus_rotation), 
                           (-radius * sinus_rotation, radius * cosinus_rotation),

ax.scatter(signal[0], signal[1], **SCATTER_STYLING)

ax.set_xlim(-fig_dim, fig_dim);
ax.set_ylim(-fig_dim, fig_dim);

$x^\prime$ and $y^\prime$ show where the first two principal components should lie. We want to learn these axises by applying and ideally retrieving the first two principal components from data.

In [26]:
data = copy.deepcopy(signal.T)
# Mean subtraction
mu = data.mean(axis=0)
data = data - mu

eigen_vectors, eigen_values, V = np.linalg.svd(data.T, full_matrices=True)
projected_data =, eigen_vectors)
sigma = projected_data.std(axis=0).mean()

first_eigenvalue_percentage = round(eigen_values[0] / eigen_values.sum() * 100, 2)
second_eigenvalue_percentage = round(eigen_values[1] / eigen_values.sum() * 100, 2)
In [25]:
fig, ax = plt.subplots(**GLOBAL_FIGURE_STYLING)
ax.scatter(np.reshape(data[:,0], (signal_length, 1)), 
           np.reshape(data[:,1], (signal_length, 1)), 

for axis in eigen_vectors:
    ax.annotate('', xytext=mu, xy=mu + sigma * axis, textcoords='data', xycoords='data', 
                arrowprops=dict(facecolor='#A60628', width=10.0))

ax.set_title('First Two EigenVectors');

Did we learn the rotation in the data after all?

In [31]:
180 * eigen_vectors / np.pi
array([[-40.452142  , -40.57623143],
       [-40.57623143,  40.452142  ]])

It is pretty good I think considering we have also fair amount of noise.

In [48]:
fig, ax = plt.subplots(**GLOBAL_FIGURE_STYLING)
ax.scatter(np.reshape(projected_data[:,0], (signal_length, 1)), 
           np.reshape(projected_data[:,1], (signal_length, 1)), 


ax.set_title('Projected Data');
In [52]:
residuals = np.abs(projected_data - signal.T).ravel()
In [53]:
fig, ax = plt.subplots(**GLOBAL_FIGURE_STYLING)
ax.hist(residuals, normed=1,histtype='stepfilled', bins=100)



This graph shows the residuls generally lie in the small-mid range. Since PCA could be considered as a minimization problem in the least square sense, total residual sum need to be minimum among all possible principal components. $l_2$ distance is good if you have noise-free or small noise. However, if you have a noisy signal or unscaled features as an input, $l_2$ fails quite badly. For those type of signals, if you still want to apply PCA, you may want to use Sparse PCA which uses $l_1$ to find the principal components in the signal.

In [54]:
print('The variance explained by first eigenvector: {}'.format(first_eigenvalue_percentage))
print('The variance explained by first eigenvector: {}'.format(second_eigenvalue_percentage))
The variance explained by first eigenvector: 75.25
The variance explained by first eigenvector: 24.75

Variances of our original Gaussian random variables are 0.3 and 0.1 respectively. The ratio of the eigenvalues reflect the variance of the eigenvectors capture in the main signal. The values of eigenvalues confirm the original variance values.

What it cannot solve?

  • If the data is not on a hyperplane and has a nonlinear structure in the observation space(like a manifold), PCA would not produce meaningful results.
  • In those scenarios, one could adopt a kernel based PCA approach to exploit the structure of data.
  • If the data has quite significant amount of noise which contributes to variance a lot, PCA result would explain "noise" rather than the original signal.

David Mackay's Comment

  • Variable Units(refer to David Mackay's comment on his book). If you measure the same unit in cm instead of inch, you'd get different principal components.
  • You need to adjust to scale of dimensions before applying PCA in order to make something useful.
  • Criticism of MacKay is not about PCA criticim per se, but how it is used among practitioners as well.


I also wrote a piece on PCA and EigenFace and released a library for eigenface written in Python. It has sample application for Yale Face database and a bunch of util tools apart from , play with it!

comments powered by Disqus