Robust Regression and Outlier Detection via Gaussian Processes

Previously, I wrote outlier detection using FFT and Median Filtering and outlier detection via MCMC. This post will be third in outlier detection series.

In the last post, I showed after removal of the outliers, one can do a linear regression on the remaining data which is called robust linear regression. However, instead of detecting the outliers then fit the regression model, we can do better. Choose a model that is robust to outliers and flexible enough to capture all main signal by excluding the outliers. But not necessarily very opinionated so that we would have some flexibility what we could capture the data as well.

Nonparametric Regression¶

Nonparametric regression provides such a mechanism. It does not make any assumption on the data and does not necessarily enforce anything on the data. By doing so, it provides great flexibility and if your signal happens to be something that the nonparametric regressions's base distribution can fit, you are in luck. However, in order to fully exploit nonparametric model, one needs more data than what she needs in the parametric model case as the data will also create the structure. Therefore, nonparametric models would work best when there is abundance of data available.

Why nonparametric regression instead of model dependent(say Linear Regression)?¶

• Basic linear regression model assumes that errors are independent. However, generally speaking, this condition does not hold true. It is not true for time-series signals as time-series has high temporal dependencies.
• Flexible representation to capture. Parametric approaches are restricted the model and parameters whereas nonparametric approaches are more flexible. They are also constrained so-called hyper parameters but do not suffer as much as parametric methods.
• Model is constructed what data suggests.
• Relation between observations and raw data is not linear even if it looks like linear. We are trying to summarize a richer relation into an intercept and slope. This not only captures a very approximate relation but also may lead wrong conclusions.
• Basically, we need better models which capture the richness of data.

Gaussian Processes¶

One of the most commonly used nonparametric models are Gaussian processes. They can be used both for regression and also for classification purposes. It brings to the table flexible priors(Gaussian as you could guess) and also nonparametric approach to the regression world. Not only that but it also provides best linear unbiased estimate(so-called BLUE) for the prediction. This behavior significantly differs from spline-like regression where they only provide smoothness guarantee, based on the regularization term. So, you could approach to the problem in a Bayesian setting out of the box. Due to nice functionalities of Gaussian distribution, the solution for the posterior distribution can be solved analytical expressions.

Regression Problem¶

Good, old linear regression problem could be formulated in the following way for the signal way; we have $y$ and $X$ vectors which we want to learn how they relate in a linear fashion in the following way(considering the noise): $$y = X^T w + \epsilon$$ where noise is modeled as in the following: $$\epsilon \sim \mathcal{N}(0, \sigma_n^2)$$

Gaussian Processes Problem Formulation¶

Likelihood¶

Given $X$ and $w$, we could get the likelihood $y$ in the following: $$P(y|X, w) = \prod_{i=1}^{n} p(y_i | x_i, w) = \prod_{i=1}^{n} \frac{1}{\sqrt{2\pi}\sigma_n} exp(-\frac{(y_i - x_i^T w)^2}{2\sigma_n^2})$$

Likelihood of gaussian extends on the gaussian distribution. If the prior is also chosen as prior, posterior will be also Gaussian. Not only that, the posterior computation will be boiled down to the linear algebra and matrix manipulations.

In a vectorized form, one could write the following expression:

$$P(y|X, w) = \frac{1}{(\sqrt{2\pi} \sigma_n^2)^n}{exp(-\frac{1}{2\sigma_n^2}(y - X^Tw)^2)}$$

This expression could be shown in a shortform like in the following: $$P(y|X, w) = \mathcal{N}(X^T w, \sigma_n^2 I)$$

After getting the likelihood, we could compute to the posterior computation.

If we want to estimate the posterior distribution, we could use famous Bayes' formula ignoring the denominator: $$p(posterior) \propto p(likelihood) p(prior)$$

$$p(w | X,y) \propto \frac{1}{(\sqrt{2\pi}\sigma_n^2)^2} exp(-\frac{1}{2\sigma_n^2}(y - X^Tw)^2) \frac{1}{(\sqrt{2\pi}\sigma_n^2)^n}exp(-\frac{1}{\sqrt{2\pi}\sigma_n^2}exp(-\frac{w^2}{2\sigma_n^2}))$$
$$p(w|X, y) \propto \mathcal{N}(\frac{1}{\sigma_n^2}(\sigma_n^{-2} X X^T y + \Sigma_p^{-1})^{-1} X y, (\sigma_n^{-2} X X^T + \Sigma_p^{-1})^{-1})$$

After linear algebraic simplification, one can get the following equation: $$p(w| X, y) \sim \mathcal{N}((XX^T)^{-1}Xy + \Sigma_p\sigma_n^{-2}Xy, \Sigma_p + (XX^T)^{-1}\sigma_n^2)$$

where $\Sigma_p$ is the Gaussian prior covariance matrix.

Mean of Gaussian distribution also happens to be mode of the distribution and Maximum a Posteriori(MAP) estimate of $w$ as well. The prediction $\bar{y}$ for new observation $\bar{x}$ is immediate: $$p(\bar{y}|\bar{x}, X, y) = \mathcal{N}(\sigma_n^{-2} \bar{x}^T (\Sigma_p + \sigma_n^2 (XX^T)^{-1}) Xy, \bar{x}^T (\Sigma_p + (XX^T)^{-1}\sigma_n^2)) \bar{x})$$

The predictive distribution is also Gaussian but I guess at this point nobody is really suprised. The interesting thing of this distribution is that the mean of the distribution is nothing but average of the posterior distribution over all possible weights.

Note that even if we start with the problem formulation, since we are using Gaussian priors, likelihoods, posterior distribution can be obtained via an analytical solution with matrix linear algebraic operations.

After problem formulation, let's look at how one can implement Gaussian processes in the next section.

In [1]:
import numpy as np
import scipy
import pymc
import random
import sklearn
from sklearn.gaussian_process import GaussianProcess
from matplotlib import pyplot as plt

Couldn't import dot_parser, loading of dot files will not be possible.

In [2]:
a = 2.
x = np.arange(50, 100, .5)
x += np.random.normal(size=x.size)
y = a * x + 10.

#y = a * np.sin(x) + 5
y_with_outlier = np.copy(y)

for ii in range(len(x)/5, len(x), len(x)/5):
y_with_outlier[ii] = 100.*(random.random() - .5) + y[ii]

#y_with_outlier = np.asarray(y_with_outlier)


I put some random noise to the signal before introducing the outliers as I want to preserve the small variance but not very distant ones.

In [3]:
plt.figure(figsize=(12, 6));
plt.scatter(range(len(y)), y, color="#7A68A6");
plt.title('Original Signal');
plt.xlim([0, len(y)]);
plt.savefig('original-signal.png', format='png', dpi=500)

In [4]:
plt.figure(figsize=(16, 8));
plt.scatter(range(y_with_outlier.size), y_with_outlier, color="#7A68A6");
plt.title('Signal with Outliers');
plt.xlim([0, len(y_with_outlier)]);
plt.savefig('original-signal-with-outlier.png', format='png', dpi=500)


Gaussian Processes via Scikit Learn¶

Scikit learn in general provides a nice API for unsupervised and supervised learning. Gaussian processes are no exception. Fitting Gaussian processes and then predict observations are only two functions; fit and predict after we created the GaussianProcess object. nugget parameter is the normalized variance and determines how flexible GaussianProcess will be in terms of capturing outliers.

In [5]:
X = np.atleast_2d(x).T
y = np.atleast_2d(y_with_outlier).T
x_pred = np.atleast_2d(np.linspace(X.min(), X.max())).T
gp = GaussianProcess(theta0=1e-4, nugget=1e-10);
gp.fit(X, y);

y_pred, mean_squared_error = gp.predict(x_pred, eval_MSE=True)
sigma = np.sqrt(mean_squared_error)
confidence_interval =  2.236 * sigma

In [9]:
plt.figure(figsize=(16, 12))
plt.scatter(X, y, color="#7A68A6", alpha=.5, label='Observations')
plt.plot(x_pred, y_pred, 'p', label='Predictions')
plt.title('Observations and Predictions');
plt.xlim([np.min(x), 120]);
plt.legend(loc='upper right');


We could label all of the observations that are outside of the condidence interval as outliers, which can be shown in the next plot.

In [14]:
plt.figure(figsize=(16, 12))
plt.scatter(X, y, color="#348ABD", alpha=.5, label='Outliers')
plt.fill(np.concatenate([x_pred, x_pred[::-1]]), np.concatenate([y_pred - confidence_interval, (y_pred + confidence_interval)[::-1]]),
alpha=.9, fc='#A60628', ec='None', label='98% confidence interval')
plt.title('Outliers and Confidence Interval');
plt.xlim([np.min(x), 120]);
plt.legend(loc='upper right');


Note that Gaussian Processes or generally nonparametric methods are very well suited to fit on the noisy data as well, we could learn any arbitrary function in theory even if it is noisy by using Gaussian Processes. Similar to the example(they use $x sin(x)$ that Scikit-Learn has for Gaussian Process regression, in the following example we want to learn a function in the form of: $$f(x) = 2\, x \, sin(x)\, cos(x)$$ and then make predictions in the data. In this one, our main aim is not to be able to detect outliers, but just regression via Gaussian Processes.

In [63]:
f = lambda k: 2 * k * np.sin(k) * np.cos(k)

X = np.linspace(0.1, 9.9, 20)
X = np.atleast_2d(X).T

y = f(X).ravel()
dy = 0.5 + 1.0 * np.random.random(y.shape)
noise = np.random.normal(0, dy)
y += noise

x = np.atleast_2d(np.linspace(0, 10, 1000)).T
gp = GaussianProcess(corr='squared_exponential', theta0=1e-1,
thetaL=1e-3, thetaU=1,
nugget=(dy / y) ** 2,
random_start=100)
gp.fit(X, y)
y_pred, MSE = gp.predict(x, eval_MSE=True)
sigma = np.sqrt(MSE)

fig = plt.figure(figsize=(12, 12))
plt.plot(x, f(x), color='#7A68A6', linestyle=':', label='$f(x) = x\,\sin(x)\,\cos(x)$')
plt.plot(x, y_pred, color='#A60628', linestyle='-', label='Prediction')
plt.fill(np.concatenate([x, x[::-1]]),
np.concatenate([y_pred - 1.9600 * sigma,
(y_pred + 1.9600 * sigma)[::-1]]),
alpha=.5, fc='#A60628', ec='None', label='95% confidence interval')
plt.xlabel('$x$')
plt.ylabel('$f(x)$')
#plt.ylim(-10, 20)
plt.legend(loc='upper right');


Note that as we get more and more data, we could be more sure of our predictions. This is generally true for nonparametric models as they are more flexible and data dependent comparing to parametric models. Probabilistic nature of the model could give us confidence intervals and they could be used for classification of the observations(outlier vs. inlier). Consider this scenario, you want to be able to detect outliers in real-time, but instead of approaching problem as a classification problem, you could use your old regresssion model that you trained and then convert it into a classifier without putting extra work. Moreover, as you get more and more data, you could both train your "classifier" and also do the prediction. Sounds quite awesome.)

What is next?¶

Up to now, I did mostly toy examples, very simple regression models. However, for models to be useful, they need to solve real-world problems. I will compare the approaches that I introduced upto now in real-world examples.