Machine Learning Newsletter

An Introduction to Unsupervised Learning via Scikit Learn

Unsupervised Learning

Unsupervised learning is the most applicable subfield on machine learning as it does not require any labels in the dataset and world is itself is an abundance of dataset. Human beings and their actions are recorded more and more every day(through photographs in Instagram, health data through wearables, internet activity through cookies and so on). Even the part of our lives which are not digital will be recorded in near future thanks to internet of things. In such a diversified and unlabeled dataset, unsupervised learning will become more and more important in the future.

Not only it could be useful for dimensionality reduction in the feature set(like a preprocessing step) but also could be useful as feature extraction method. PCA(Principal Component Analysis) could be one of the most used unsupervised learning algorithm(PCA to unsupervised learning, linear regression equivalent to regression). It could be used both a dimensionality reduction in order to reduce data but also while it compresses(reduces) data, since it tries to capture the variance, it could pick up interesting featurse, so could be used as a feature extraction method.

In this notebook, I will use PCA to both reduce dimensionality in the dataset and also build our feature vector. This specific method is called EigenFace(due to PCA extracting the eigenvectors and they could be visualized as face).

Dimensionality Reduction & Feature Extraction via PCA (EigenFace)

In [20]:
%matplotlib inline
import itertools
import numpy as np
import matplotlib as mlp
import matplotlib.pyplot as plt
import pandas as pd
import scipy


from sklearn import cluster
from sklearn import datasets
from sklearn import metrics
from sklearn.neighbors import kneighbors_graph
from sklearn.preprocessing import StandardScaler

from sklearn import decomposition # PCA

import time

Modified Olivetti faces dataset

In [21]:
faces = datasets.olivetti_faces.fetch_olivetti_faces()
In [22]:
print(faces.DESCR)
Modified Olivetti faces dataset.

The original database was available from (now defunct)

    http://www.uk.research.att.com/facedatabase.html

The version retrieved here comes in MATLAB format from the personal
web page of Sam Roweis:

    http://www.cs.nyu.edu/~roweis/

There are ten different images of each of 40 distinct subjects. For some
subjects, the images were taken at different times, varying the lighting,
facial expressions (open / closed eyes, smiling / not smiling) and facial
details (glasses / no glasses). All the images were taken against a dark
homogeneous background with the subjects in an upright, frontal position (with
tolerance for some side movement).

The original dataset consisted of 92 x 112, while the Roweis version
consists of 64x64 images.

In [23]:
faces_images = faces['images']
faces_data = faces.data
In [24]:
faces_images.shape
Out[24]:
(400, 64, 64)

Images are nothing but reshaped faces_data. It makes it easier for us to visualize the faces in an image grid but, we will use the faces_data to apply PCA. This is because PCA expects two dimensions (n_observations, n_dimensions).

In [25]:
faces_data.shape
Out[25]:
(400, 4096)

We have 400 instances and their vector dimension is 4096(256 * 256). We could reduce this dimension by applying PCA(Principal Component Analysis) and also PCA would extract features as well.

Let's see some faces.

In [26]:
fig = plt.figure(figsize=(16, 16))
for ii in range(64):
    plt.subplot(8, 8, ii + 1) # It starts with one
    plt.imshow(faces_images[ii], cmap=plt.cm.gray)
    plt.grid(False);
    plt.xticks([]);
    plt.yticks([]);

Let's try first number of eigenfaces 16 and then see what type of eigenfaces we get. Note that I am also passing the whiten=True in the PCA in order to remove the low-frequency(constant) areas in the face as those areas are not as important as areas that have variation and change(PCA also makes an assumption on the data very similarly).

In [27]:
n_eigenfaces = 16
# Creating PCA object
pca = decomposition.RandomizedPCA(n_components=n_eigenfaces, whiten=True)
# We are applying PCA to the data
pca.fit(faces_data)
Out[27]:
RandomizedPCA(copy=True, iterated_power=3, n_components=16, random_state=None,
       whiten=True)
In [28]:
pca.components_.shape
Out[28]:
(16, 4096)
In [29]:
plt.figure(figsize=(16, 16));
plt.suptitle('EigenFaces');
for ii in range(pca.components_.shape[0]):
    plt.subplot(4, 4, ii + 1) # It starts with one
    plt.imshow(pca.components_[ii].reshape(64, 64), cmap=plt.cm.gray)
    plt.grid(False);
    plt.xticks([]);
    plt.yticks([]);
In [30]:
with plt.style.context('fivethirtyeight'):
    plt.figure(figsize=(16, 12));
    plt.title('Explained Variance Ratio over Component');
    plt.plot(pca.explained_variance_ratio_);
In [31]:
with plt.style.context('fivethirtyeight'):
    plt.figure(figsize=(16, 12));
    plt.title('Cumulative Explained Variance over EigenFace');
    plt.plot(pca.explained_variance_ratio_.cumsum());
In [32]:
print('PCA captures {:.2f} percent of the variance in the dataset'.format(pca.explained_variance_ratio_.sum() * 100))
PCA captures 73.09 percent of the variance in the dataset

This is kind of low, we generally want 95% to be able to say, we could represent the dataset quite accurately with small number of dimensions. Let's increase the number of dimensions.

In [33]:
n_eigenfaces = 121
# Creating PCA object
pca = decomposition.RandomizedPCA(n_components=n_eigenfaces, whiten=True)
# We are applying PCA to the data
pca.fit(faces_data)
Out[33]:
RandomizedPCA(copy=True, iterated_power=3, n_components=121,
       random_state=None, whiten=True)
In [34]:
with plt.style.context('fivethirtyeight'):
    plt.figure(figsize=(16, 12));
    plt.title('Cumulative Explained Variance over EigenFace ');
    plt.plot(pca.explained_variance_ratio_.cumsum());
In [35]:
print('PCA captures {:.2f} percent of the variance in the dataset'.format(pca.explained_variance_ratio_.sum() * 100))
PCA captures 94.85 percent of the variance in the dataset

This is good enough. Let's look what kind of eigenfaces we have this time.

In [36]:
plt.figure(figsize=(16, 16));
plt.suptitle('EigenFaces');
for ii in range(pca.components_.shape[0]):
    plt.subplot(11, 11, ii + 1) # It starts with one
    plt.imshow(pca.components_[ii].reshape(64, 64), cmap=plt.cm.gray)
    plt.grid(False);
    plt.xticks([]);
    plt.yticks([]);