# An Introduction to Unsupervised Learning via Scikit Learn

## Unsupervised Learning¶

Unsupervised learning is the most applicable subfield on machine learning as it does not require any labels in the dataset and world is **itself** is an abundance of dataset. Human beings and their actions are recorded more and more every day(through photographs in Instagram, health data through wearables, internet activity through cookies and so on). Even the part of our lives which are not digital will be recorded in near future thanks to internet of things. In such a diversified and unlabeled dataset, unsupervised learning will become more and more important in the future.

Not only it could be useful for dimensionality reduction in the feature set(like a preprocessing step) but also could be useful as feature extraction method. PCA(Principal Component Analysis) could be one of the most used unsupervised learning algorithm(PCA to unsupervised learning, linear regression equivalent to regression). It could be used both a dimensionality reduction in order to reduce data but also while it compresses(reduces) data, since it tries to capture the variance, it could pick up *interesting* featurse, so could be used as a feature extraction method.

In this notebook, I will use PCA to both reduce dimensionality in the dataset and also build our feature vector. This specific method is called EigenFace(due to PCA extracting the eigenvectors and they could be visualized as face).

## Dimensionality Reduction & Feature Extraction via PCA (EigenFace)¶

```
%matplotlib inline
import itertools
import numpy as np
import matplotlib as mlp
import matplotlib.pyplot as plt
import pandas as pd
import scipy
from sklearn import cluster
from sklearn import datasets
from sklearn import metrics
from sklearn.neighbors import kneighbors_graph
from sklearn.preprocessing import StandardScaler
from sklearn import decomposition # PCA
import time
```

#### Modified Olivetti faces dataset¶

```
faces = datasets.olivetti_faces.fetch_olivetti_faces()
```

```
print(faces.DESCR)
```

```
faces_images = faces['images']
faces_data = faces.data
```

```
faces_images.shape
```

Images are nothing but reshaped `faces_data`

. It makes it easier for us to visualize the faces in an image grid but, we will
use the faces_data to apply PCA. This is because PCA expects two dimensions `(n_observations, n_dimensions)`

.

```
faces_data.shape
```

We have 400 instances and their vector dimension is 4096(256 * 256). We could reduce this dimension by applying PCA(Principal Component Analysis) and also PCA would extract features as well.

Let's see some faces.

```
fig = plt.figure(figsize=(16, 16))
for ii in range(64):
plt.subplot(8, 8, ii + 1) # It starts with one
plt.imshow(faces_images[ii], cmap=plt.cm.gray)
plt.grid(False);
plt.xticks([]);
plt.yticks([]);
```

Let's try first number of eigenfaces 16 and then see what type of eigenfaces we get. Note that I am also passing the `whiten=True`

in the PCA in order to remove the low-frequency(constant) areas in the face as those areas are not as important as areas that have variation and change(PCA also makes an assumption on the data very similarly).

```
n_eigenfaces = 16
# Creating PCA object
pca = decomposition.RandomizedPCA(n_components=n_eigenfaces, whiten=True)
# We are applying PCA to the data
pca.fit(faces_data)
```

```
pca.components_.shape
```

```
plt.figure(figsize=(16, 16));
plt.suptitle('EigenFaces');
for ii in range(pca.components_.shape[0]):
plt.subplot(4, 4, ii + 1) # It starts with one
plt.imshow(pca.components_[ii].reshape(64, 64), cmap=plt.cm.gray)
plt.grid(False);
plt.xticks([]);
plt.yticks([]);
```

```
with plt.style.context('fivethirtyeight'):
plt.figure(figsize=(16, 12));
plt.title('Explained Variance Ratio over Component');
plt.plot(pca.explained_variance_ratio_);
```

```
with plt.style.context('fivethirtyeight'):
plt.figure(figsize=(16, 12));
plt.title('Cumulative Explained Variance over EigenFace');
plt.plot(pca.explained_variance_ratio_.cumsum());
```

```
print('PCA captures {:.2f} percent of the variance in the dataset'.format(pca.explained_variance_ratio_.sum() * 100))
```

This is kind of low, we generally want 95% to be able to say, we could represent the dataset quite accurately with small number of dimensions. Let's increase the number of dimensions.

```
n_eigenfaces = 121
# Creating PCA object
pca = decomposition.RandomizedPCA(n_components=n_eigenfaces, whiten=True)
# We are applying PCA to the data
pca.fit(faces_data)
```

```
with plt.style.context('fivethirtyeight'):
plt.figure(figsize=(16, 12));
plt.title('Cumulative Explained Variance over EigenFace ');
plt.plot(pca.explained_variance_ratio_.cumsum());
```

```
print('PCA captures {:.2f} percent of the variance in the dataset'.format(pca.explained_variance_ratio_.sum() * 100))
```

This is good enough. Let's look what kind of eigenfaces we have this time.

```
plt.figure(figsize=(16, 16));
plt.suptitle('EigenFaces');
for ii in range(pca.components_.shape[0]):
plt.subplot(11, 11, ii + 1) # It starts with one
plt.imshow(pca.components_[ii].reshape(64, 64), cmap=plt.cm.gray)
plt.grid(False);
plt.xticks([]);
plt.yticks([]);
```