Outlier Detection via Markov Chain Monte Carlo

Previously, I wrote outlier detection using FFT and Median Filtering and this post will be second in that series where I will look at the outlier detection in time-series using Markov Chaing Monte Carlo(MCMC). If you are familiar with Python and want to use MCMC, you should definitely check out PyMC, which I generally use for Bayesian Modeling and MCMC in Python. Although my main focus is the outliers, since I will be using a linear signal as an example, this method could be also used as a robust linear regression to ignore the outliers and focus on the main signal.

I will give a talk next week in PyData(on Sunday) focusing on these approaches for outlier detection in time series signals, use go2-PYDATA code to get 15% off if you are not yet registered.

After shameless promotion, let's get started.

In [1]:
import numpy as np
import scipy
import pymc
import random
from matplotlib import pyplot as plt
Couldn't import dot_parser, loading of dot files will not be possible.

In [29]:
a = 2
x = np.arange(10,50,.5)
# Add some noise
x += np.random.normal(size=x.size)
y = a * x + 10

#y = a * np.sin(x) + 5
y_with_outlier = np.copy(y)

# Add some outliers
for ii in np.arange(len(x)/5, len(x), len(x)/5.):
    y_with_outlier[ii]= 60*(random.random()-.5) + y[ii]
    
y_with_outlier = np.asarray(y_with_outlier)
# assuming I do not know, just fill it with ones
spread = np.asarray([1 for _ in range(y_with_outlier.size)])

I put some random noise to the signal before introducing the outliers as I want to preserve the small variance but not very distant ones.

In [30]:
plt.figure(figsize=(12, 6));
plt.scatter(range(len(y)), y, color="#7A68A6");
plt.title('Original Signal');
plt.xlim([0, len(y)]);
In [31]:
plt.figure(figsize=(12, 6));
plt.scatter(range(y_with_outlier.size), y_with_outlier, color="#7A68A6");
plt.title('Signal with Outliers');
plt.xlim([0, len(y_with_outlier)]);

Outliers are much more distant than the noise so visually it is easy to detect in general. In this application, I choose a basic linear signal, but as long as the signal can be modeled(however all of the models are wrong), then I could capture outliers(do not forget some models are useful).

Priors

All of the priors are chosen as uniform as I do not assume any prior knowledge where the outliers may occur in the signal. If you have prior knowledge of what might be outliers look like in the distribution, this would help both the inference as well as the outlier detection. If you do not have the priors, the data will lead the way eventually.

If the component that measures the signal has a failure rate, that probability could be easily incorporated to our model. Especially, if you have small data(nowadays, everyone has big, but just for the sake argument), those priors would become handy in order to learn the posterior distribution of the variables that you are interested.

\[p_{outlier} = U(0, 1)\] \[p_{mean-outlier} = U(-100, 100)\] \[p_{spread-outlier} = U(-100, 100)\]

In [32]:
outlier_points = pymc.Uniform('outlier_points', 0, 1.0, value=0.1)
mean_outliers = pymc.Uniform('mean_outliers', -100, 100, value=0)
spread_outliers = pymc.Uniform('spread_outliers', -100, 100, value=0)

Models

I define basic linear regression as model and sample slope and intercept assuming the signal that shows linearity results in higher posterior probabilities than the ones that do not obey(read outliers). The inlier points are chosen from Bernoulli distribution: \[p_{inlier} = B(n, 1-p_{outlier})\] where \(n\) is the signal length. Note that I also define another posterior computation for outlier based on gaussian distribution in this section. This will be used to model the outliers in the next cell. The idea behind this approach, the distant observations(some call outliers), will have lower posterior, and by doing so I could identify them.

In [33]:
@pymc.stochastic
def slope_and_intercept(value=[1., 10.]):
    slope, intercept = value
    prob_slope = np.log(1. / (1. + slope ** 2))
    return prob_slope

@pymc.deterministic
def model_(x=x, slope_and_intercept=slope_and_intercept):
    slope, intercept = slope_and_intercept
    fit = slope * x + intercept 
    return fit

inlier = pymc.Bernoulli('inlier', p=1 - outlier_points, value=np.zeros(x.size))

def log_posterior_likelihood_of_outlier(y_with_outlier, mu, spread, inlier, mean_outliers, spread_outliers):
    inlier_posterior = np.sum(inlier * (np.log(2 * np.pi * spread ** 2) + (y_with_outlier - mu) ** 2 / (spread ** 2)))
    outlier_posterior = np.sum((1 - inlier) * (np.log(2 * np.pi * ((spread ** 2) + (spread_outliers ** 2))) + (y_with_outlier - mean_outliers) ** 2 / ((spread ** 2) + (spread_outliers ** 2))))
    return -0.5 * (inlier_posterior + outlier_posterior)

If you want to define your own distribution(or wrap some distribution for your own purposes), PyMC supports both stochastic_from_dist and stochastic_from_data to get your custom distributions. Then, you wrap all of the models and variables into one model to infer the unknowns. I am interested in the outliers so I define our outlier model based on Gaussian model where I would use inlier and outlier mean and spread to capture the differences between inlier and outlier distribution. The basic assumption is that the spread of the outlier as you could see from the graph is larger than inliers. This stems from the definition of the outlier as it could be defined as the the most distant observations from the distribution. I will use the posterior likelihood of the outlier in the custom distribution.(mv just says the variable is in the array form, what it stands for is beyond my faculty, though.

In [50]:
outlier_distribution = pymc.stochastic_from_dist('outlier_distribution',
                                              logp=log_posterior_likelihood_of_outlier,
                                              dtype=np.float,
                                              mv=True)


outlier_dist = outlier_distribution('outlier_dist', 
                          mu=model_, 
                          spread=spread,
                          mean_outliers=mean_outliers, 
                          spread_outliers=spread_outliers,
                          inlier=inlier,
                          observed=True, 
                          value=y_with_outlier)

model = dict(outlier_dist=outlier_dist, 
             slope_and_intercept=slope_and_intercept, 
             model_=model_,
             inlier=inlier, 
             outlier_points=outlier_points, 
             mean_outliers=mean_outliers, 
             spread_outliers=spread_outliers
            )

Inference

After getting the posterior probabilities, I will get the outliers based on the probability. The points that have smaller probabilities than some threshold become our outliers. If I want to be more restrictive on outliers, I could put a smaller threshold. Then, I would get only very distant outliers, whereas by increasing the probability I would capture the signal which has low variance. Threshold in here exists but this threshold is actually a probability measure of the posterior distribution, so it is very useful. Since it is not arbitrary, it generalizes better than the thresholds that are put in the frequentist approaches.

In [51]:
prob_threshold = 0.40

mcmc = pymc.MCMC(model)
mcmc.sample(100000, 20000)

slope_and_intercept_trace = mcmc.trace('slope_and_intercept')[:]
model_trace = mcmc.trace('model_')[:]
outlier_points_trace = mcmc.trace('outlier_points')[:]
spread_outliers_trace = mcmc.trace('spread_outliers')[:]

probability_of_points = mcmc.trace('inlier')[:].astype(float).mean(0)
outlier_x = x[probability_of_points < prob_threshold]
outlier_y = y_with_outlier[probability_of_points < prob_threshold]

 [                  0%                  ] 220 of 100000 complete in 0.5 sec
 [                  0%                  ] 688 of 100000 complete in 1.0 sec
 [                  1%                  ] 1164 of 100000 complete in 1.5 sec
 [                  1%                  ] 1631 of 100000 complete in 2.0 sec
 [                  1%                  ] 1987 of 100000 complete in 2.5 sec
 [                  2%                  ] 2406 of 100000 complete in 3.0 sec
 [-                 2%                  ] 2786 of 100000 complete in 3.5 sec
 [-                 3%                  ] 3197 of 100000 complete in 4.0 sec
 [-                 3%                  ] 3531 of 100000 complete in 4.5 sec
 [-                 3%                  ] 3871 of 100000 complete in 5.0 sec
 [-                 4%                  ] 4216 of 100000 complete in 5.5 sec
 [-                 4%                  ] 4631 of 100000 complete in 6.0 sec
 [-                 5%                  ] 5065 of 100000 complete in 6.5 sec
 [--                5%                  ] 5521 of 100000 complete in 7.0 sec
 [--                5%                  ] 5920 of 100000 complete in 7.5 sec
 [--                6%                  ] 6342 of 100000 complete in 8.0 sec
 [--                6%                  ] 6762 of 100000 complete in 8.5 sec
 [--                7%                  ] 7112 of 100000 complete in 9.0 sec
 [--                7%                  ] 7558 of 100000 complete in 9.5 sec
 [--                7%                  ] 7873 of 100000 complete in 10.0 sec
 [---               8%                  ] 8301 of 100000 complete in 10.5 sec
 [---               8%                  ] 8815 of 100000 complete in 11.0 sec
 [---               9%                  ] 9292 of 100000 complete in 11.5 sec
 [---               9%                  ] 9705 of 100000 complete in 12.0 sec
 [---              10%                  ] 10235 of 100000 complete in 12.5 sec
 [----             10%                  ] 10724 of 100000 complete in 13.0 sec
 [----             11%                  ] 11146 of 100000 complete in 13.5 sec
 [----             11%                  ] 11606 of 100000 complete in 14.0 sec
 [----             11%                  ] 11956 of 100000 complete in 14.5 sec
 [----             12%                  ] 12298 of 100000 complete in 15.0 sec
 [----             12%                  ] 12725 of 100000 complete in 15.5 sec
 [----             13%                  ] 13096 of 100000 complete in 16.0 sec
 [-----            13%                  ] 13505 of 100000 complete in 16.5 sec
 [-----            13%                  ] 13880 of 100000 complete in 17.0 sec
 [-----            14%                  ] 14267 of 100000 complete in 17.5 sec
 [-----            14%                  ] 14559 of 100000 complete in 18.0 sec
 [-----            14%                  ] 14934 of 100000 complete in 18.5 sec
 [-----            15%                  ] 15207 of 100000 complete in 19.0 sec
 [-----            15%                  ] 15511 of 100000 complete in 19.5 sec
 [------           15%                  ] 15791 of 100000 complete in 20.0 sec
 [------           16%                  ] 16125 of 100000 complete in 20.5 sec
 [------           16%                  ] 16478 of 100000 complete in 21.0 sec
 [------           16%                  ] 16774 of 100000 complete in 21.5 sec
 [------           16%                  ] 16984 of 100000 complete in 22.0 sec
 [------           17%                  ] 17250 of 100000 complete in 22.5 sec
 [------           17%                  ] 17605 of 100000 complete in 23.0 sec
 [------           18%                  ] 18019 of 100000 complete in 23.5 sec
 [------           18%                  ] 18379 of 100000 complete in 24.0 sec
 [-------          18%                  ] 18723 of 100000 complete in 24.5 sec
 [-------          19%                  ] 19083 of 100000 complete in 25.0 sec
 [-------          19%                  ] 19440 of 100000 complete in 25.5 sec
 [-------          19%                  ] 19661 of 100000 complete in 26.0 sec
 [-------          19%                  ] 19964 of 100000 complete in 26.5 sec
 [-------          20%                  ] 20229 of 100000 complete in 27.0 sec
 [-------          20%                  ] 20496 of 100000 complete in 27.5 sec
 [-------          20%                  ] 20727 of 100000 complete in 28.1 sec
 [-------          20%                  ] 20999 of 100000 complete in 28.6 sec
 [--------         21%                  ] 21457 of 100000 complete in 29.1 sec
 [--------         21%                  ] 21928 of 100000 complete in 29.6 sec
 [--------         22%                  ] 22243 of 100000 complete in 30.1 sec
 [--------         22%                  ] 22553 of 100000 complete in 30.6 sec
 [--------         22%                  ] 22875 of 100000 complete in 31.1 sec
 [--------         23%                  ] 23211 of 100000 complete in 31.6 sec
 [--------         23%                  ] 23539 of 100000 complete in 32.1 sec
 [---------        23%                  ] 23900 of 100000 complete in 32.6 sec
 [---------        24%                  ] 24317 of 100000 complete in 33.1 sec
 [---------        24%                  ] 24640 of 100000 complete in 33.6 sec
 [---------        24%                  ] 24979 of 100000 complete in 34.1 sec
 [---------        25%                  ] 25279 of 100000 complete in 34.6 sec
 [---------        25%                  ] 25600 of 100000 complete in 35.1 sec
 [---------        25%                  ] 25921 of 100000 complete in 35.6 sec
 [---------        26%                  ] 26272 of 100000 complete in 36.1 sec
 [----------       26%                  ] 26607 of 100000 complete in 36.6 sec
 [----------       26%                  ] 26998 of 100000 complete in 37.1 sec
 [----------       27%                  ] 27355 of 100000 complete in 37.6 sec
 [----------       27%                  ] 27712 of 100000 complete in 38.1 sec
 [----------       28%                  ] 28037 of 100000 complete in 38.6 sec
 [----------       28%                  ] 28361 of 100000 complete in 39.1 sec
 [----------       28%                  ] 28663 of 100000 complete in 39.6 sec
 [-----------      28%                  ] 28989 of 100000 complete in 40.1 sec
 [-----------      29%                  ] 29171 of 100000 complete in 40.6 sec
 [-----------      29%                  ] 29410 of 100000 complete in 41.1 sec
 [-----------      29%                  ] 29725 of 100000 complete in 41.6 sec
 [-----------      30%                  ] 30036 of 100000 complete in 42.1 sec
 [-----------      30%                  ] 30360 of 100000 complete in 42.6 sec
 [-----------      30%                  ] 30683 of 100000 complete in 43.1 sec
 [-----------      31%                  ] 31023 of 100000 complete in 43.6 sec
 [-----------      31%                  ] 31355 of 100000 complete in 44.1 sec
 [------------     31%                  ] 31726 of 100000 complete in 44.6 sec
 [------------     32%                  ] 32080 of 100000 complete in 45.1 sec
 [------------     32%                  ] 32419 of 100000 complete in 45.6 sec
 [------------     32%                  ] 32749 of 100000 complete in 46.1 sec
 [------------     33%                  ] 33054 of 100000 complete in 46.6 sec
 [------------     33%                  ] 33316 of 100000 complete in 47.1 sec
 [------------     33%                  ] 33659 of 100000 complete in 47.6 sec
 [------------     34%                  ] 34004 of 100000 complete in 48.1 sec
 [-------------    34%                  ] 34373 of 100000 complete in 48.6 sec
 [-------------    34%                  ] 34683 of 100000 complete in 49.1 sec
 [-------------    35%                  ] 35004 of 100000 complete in 49.6 sec
 [-------------    35%                  ] 35322 of 100000 complete in 50.1 sec
 [-------------    35%                  ] 35660 of 100000 complete in 50.6 sec
 [-------------    35%                  ] 35988 of 100000 complete in 51.1 sec
 [-------------    36%                  ] 36288 of 100000 complete in 51.6 sec
 [-------------    36%                  ] 36649 of 100000 complete in 52.1 sec
 [--------------   36%                  ] 36968 of 100000 complete in 52.6 sec
 [--------------   37%                  ] 37270 of 100000 complete in 53.1 sec
 [--------------   37%                  ] 37616 of 100000 complete in 53.6 sec
 [--------------   37%                  ] 37943 of 100000 complete in 54.1 sec
 [--------------   38%                  ] 38279 of 100000 complete in 54.6 sec
 [--------------   38%                  ] 38646 of 100000 complete in 55.1 sec
 [--------------   39%                  ] 39005 of 100000 complete in 55.6 sec
 [--------------   39%                  ] 39382 of 100000 complete in 56.1 sec
 [---------------  39%                  ] 39744 of 100000 complete in 56.6 sec
 [---------------  40%                  ] 40073 of 100000 complete in 57.1 sec
 [---------------  40%                  ] 40440 of 100000 complete in 57.6 sec
 [---------------  40%                  ] 40832 of 100000 complete in 58.1 sec
 [---------------  41%                  ] 41175 of 100000 complete in 58.6 sec
 [---------------  41%                  ] 41517 of 100000 complete in 59.1 sec
 [---------------  41%                  ] 41873 of 100000 complete in 59.6 sec
 [---------------- 42%                  ] 42215 of 100000 complete in 60.1 sec
 [---------------- 42%                  ] 42556 of 100000 complete in 60.6 sec
 [---------------- 42%                  ] 42911 of 100000 complete in 61.1 sec
 [---------------- 43%                  ] 43305 of 100000 complete in 61.6 sec
 [---------------- 43%                  ] 43650 of 100000 complete in 62.1 sec
 [---------------- 43%                  ] 43973 of 100000 complete in 62.6 sec
 [---------------- 44%                  ] 44316 of 100000 complete in 63.1 sec
 [---------------- 44%                  ] 44658 of 100000 complete in 63.6 sec
 [-----------------44%                  ] 44995 of 100000 complete in 64.1 sec
 [-----------------45%                  ] 45356 of 100000 complete in 64.6 sec
 [-----------------45%                  ] 45686 of 100000 complete in 65.1 sec
 [-----------------46%                  ] 46074 of 100000 complete in 65.6 sec
 [-----------------46%                  ] 46417 of 100000 complete in 66.1 sec
 [-----------------46%                  ] 46752 of 100000 complete in 66.6 sec
 [-----------------47%                  ] 47090 of 100000 complete in 67.1 sec
 [-----------------47%                  ] 47432 of 100000 complete in 67.6 sec
 [-----------------47%                  ] 47768 of 100000 complete in 68.1 sec
 [-----------------48%                  ] 48129 of 100000 complete in 68.6 sec
 [-----------------48%                  ] 48487 of 100000 complete in 69.1 sec
 [-----------------48%                  ] 48819 of 100000 complete in 69.6 sec
 [-----------------49%                  ] 49215 of 100000 complete in 70.1 sec
 [-----------------49%                  ] 49532 of 100000 complete in 70.6 sec
 [-----------------49%                  ] 49865 of 100000 complete in 71.1 sec
 [-----------------50%                  ] 50191 of 100000 complete in 71.6 sec
 [-----------------50%                  ] 50503 of 100000 complete in 72.1 sec
 [-----------------50%                  ] 50811 of 100000 complete in 72.6 sec
 [-----------------51%                  ] 51148 of 100000 complete in 73.1 sec
 [-----------------51%                  ] 51454 of 100000 complete in 73.6 sec
 [-----------------51%                  ] 51792 of 100000 complete in 74.1 sec
 [-----------------52%                  ] 52138 of 100000 complete in 74.6 sec
 [-----------------52%                  ] 52471 of 100000 complete in 75.1 sec
 [-----------------52%                  ] 52871 of 100000 complete in 75.6 sec
 [-----------------53%                  ] 53221 of 100000 complete in 76.1 sec
 [-----------------53%                  ] 53563 of 100000 complete in 76.6 sec
 [-----------------53%                  ] 53891 of 100000 complete in 77.1 sec
 [-----------------54%                  ] 54324 of 100000 complete in 77.6 sec
 [-----------------54%                  ] 54675 of 100000 complete in 78.1 sec
 [-----------------55%                  ] 55004 of 100000 complete in 78.6 sec
 [-----------------55%-                 ] 55327 of 100000 complete in 79.1 sec
 [-----------------55%-                 ] 55663 of 100000 complete in 79.6 sec
 [-----------------55%-                 ] 55964 of 100000 complete in 80.1 sec
 [-----------------56%-                 ] 56239 of 100000 complete in 80.6 sec
 [-----------------56%-                 ] 56537 of 100000 complete in 81.1 sec
 [-----------------56%-                 ] 56830 of 100000 complete in 81.6 sec
 [-----------------57%-                 ] 57104 of 100000 complete in 82.2 sec
 [-----------------57%-                 ] 57410 of 100000 complete in 82.7 sec
 [-----------------57%-                 ] 57727 of 100000 complete in 83.2 sec
 [-----------------58%--                ] 58062 of 100000 complete in 83.7 sec
 [-----------------58%--                ] 58387 of 100000 complete in 84.2 sec
 [-----------------58%--                ] 58694 of 100000 complete in 84.7 sec
 [-----------------59%--                ] 59008 of 100000 complete in 85.2 sec
 [-----------------59%--                ] 59324 of 100000 complete in 85.7 sec
 [-----------------59%--                ] 59643 of 100000 complete in 86.2 sec
 [-----------------59%--                ] 59957 of 100000 complete in 86.7 sec
 [-----------------60%--                ] 60276 of 100000 complete in 87.2 sec
 [-----------------60%---               ] 60637 of 100000 complete in 87.7 sec
 [-----------------60%---               ] 60982 of 100000 complete in 88.2 sec
 [-----------------61%---               ] 61313 of 100000 complete in 88.7 sec
 [-----------------61%---               ] 61660 of 100000 complete in 89.2 sec
 [-----------------61%---               ] 61962 of 100000 complete in 89.7 sec
 [-----------------62%---               ] 62289 of 100000 complete in 90.2 sec
 [-----------------62%---               ] 62609 of 100000 complete in 90.7 sec
 [-----------------62%---               ] 62917 of 100000 complete in 91.2 sec
 [-----------------63%----              ] 63227 of 100000 complete in 91.7 sec
 [-----------------63%----              ] 63536 of 100000 complete in 92.2 sec
 [-----------------63%----              ] 63840 of 100000 complete in 92.7 sec
 [-----------------64%----              ] 64150 of 100000 complete in 93.2 sec
 [-----------------64%----              ] 64501 of 100000 complete in 93.7 sec
 [-----------------64%----              ] 64822 of 100000 complete in 94.2 sec
 [-----------------65%----              ] 65126 of 100000 complete in 94.7 sec
 [-----------------65%----              ] 65423 of 100000 complete in 95.2 sec
 [-----------------65%----              ] 65741 of 100000 complete in 95.7 sec
 [-----------------66%-----             ] 66062 of 100000 complete in 96.2 sec
 [-----------------66%-----             ] 66378 of 100000 complete in 96.7 sec
 [-----------------66%-----             ] 66679 of 100000 complete in 97.2 sec
 [-----------------67%-----             ] 67021 of 100000 complete in 97.7 sec
 [-----------------67%-----             ] 67309 of 100000 complete in 98.2 sec
 [-----------------67%-----             ] 67628 of 100000 complete in 98.7 sec
 [-----------------67%-----             ] 67911 of 100000 complete in 99.2 sec
 [-----------------68%-----             ] 68242 of 100000 complete in 99.7 sec
 [-----------------68%------            ] 68570 of 100000 complete in 100.2 sec
 [-----------------68%------            ] 68864 of 100000 complete in 100.7 sec
 [-----------------69%------            ] 69170 of 100000 complete in 101.2 sec
 [-----------------69%------            ] 69503 of 100000 complete in 101.7 sec
 [-----------------69%------            ] 69783 of 100000 complete in 102.2 sec
 [-----------------70%------            ] 70101 of 100000 complete in 102.7 sec
 [-----------------70%------            ] 70401 of 100000 complete in 103.2 sec
 [-----------------70%------            ] 70711 of 100000 complete in 103.7 sec
 [-----------------71%------            ] 71027 of 100000 complete in 104.2 sec
 [-----------------71%-------           ] 71330 of 100000 complete in 104.7 sec
 [-----------------71%-------           ] 71629 of 100000 complete in 105.2 sec
 [-----------------72%-------           ] 72115 of 100000 complete in 105.7 sec
 [-----------------72%-------           ] 72604 of 100000 complete in 106.2 sec
 [-----------------72%-------           ] 72951 of 100000 complete in 106.7 sec
 [-----------------73%-------           ] 73265 of 100000 complete in 107.2 sec
 [-----------------73%-------           ] 73608 of 100000 complete in 107.7 sec
 [-----------------73%--------          ] 73918 of 100000 complete in 108.2 sec
 [-----------------74%--------          ] 74248 of 100000 complete in 108.7 sec
 [-----------------74%--------          ] 74612 of 100000 complete in 109.2 sec
 [-----------------74%--------          ] 74954 of 100000 complete in 109.7 sec
 [-----------------75%--------          ] 75304 of 100000 complete in 110.2 sec
 [-----------------75%--------          ] 75691 of 100000 complete in 110.7 sec
 [-----------------76%--------          ] 76085 of 100000 complete in 111.2 sec
 [-----------------76%---------         ] 76495 of 100000 complete in 111.7 sec
 [-----------------76%---------         ] 76842 of 100000 complete in 112.2 sec
 [-----------------77%---------         ] 77196 of 100000 complete in 112.7 sec
 [-----------------77%---------         ] 77578 of 100000 complete in 113.2 sec
 [-----------------77%---------         ] 77918 of 100000 complete in 113.7 sec
 [-----------------78%---------         ] 78285 of 100000 complete in 114.2 sec
 [-----------------78%---------         ] 78641 of 100000 complete in 114.7 sec
 [-----------------78%----------        ] 78995 of 100000 complete in 115.2 sec
 [-----------------79%----------        ] 79371 of 100000 complete in 115.7 sec
 [-----------------79%----------        ] 79729 of 100000 complete in 116.2 sec
 [-----------------80%----------        ] 80050 of 100000 complete in 116.7 sec
 [-----------------80%----------        ] 80394 of 100000 complete in 117.2 sec
 [-----------------80%----------        ] 80786 of 100000 complete in 117.7 sec
 [-----------------81%----------        ] 81125 of 100000 complete in 118.2 sec
 [-----------------81%----------        ] 81451 of 100000 complete in 118.7 sec
 [-----------------81%-----------       ] 81764 of 100000 complete in 119.2 sec
 [-----------------82%-----------       ] 82132 of 100000 complete in 119.7 sec
 [-----------------82%-----------       ] 82496 of 100000 complete in 120.2 sec
 [-----------------82%-----------       ] 82865 of 100000 complete in 120.7 sec
 [-----------------83%-----------       ] 83193 of 100000 complete in 121.2 sec
 [-----------------83%-----------       ] 83572 of 100000 complete in 121.7 sec
 [-----------------83%-----------       ] 83901 of 100000 complete in 122.2 sec
 [-----------------84%------------      ] 84223 of 100000 complete in 122.7 sec
 [-----------------84%------------      ] 84597 of 100000 complete in 123.2 sec
 [-----------------84%------------      ] 84893 of 100000 complete in 123.7 sec
 [-----------------85%------------      ] 85231 of 100000 complete in 124.2 sec
 [-----------------85%------------      ] 85552 of 100000 complete in 124.7 sec
 [-----------------85%------------      ] 85899 of 100000 complete in 125.2 sec
 [-----------------86%------------      ] 86250 of 100000 complete in 125.7 sec
 [-----------------86%------------      ] 86584 of 100000 complete in 126.2 sec
 [-----------------86%-------------     ] 86926 of 100000 complete in 126.7 sec
 [-----------------87%-------------     ] 87269 of 100000 complete in 127.2 sec
 [-----------------87%-------------     ] 87631 of 100000 complete in 127.7 sec
 [-----------------87%-------------     ] 87965 of 100000 complete in 128.2 sec
 [-----------------88%-------------     ] 88274 of 100000 complete in 128.7 sec
 [-----------------88%-------------     ] 88597 of 100000 complete in 129.2 sec
 [-----------------88%-------------     ] 88944 of 100000 complete in 129.7 sec
 [-----------------89%-------------     ] 89319 of 100000 complete in 130.2 sec
 [-----------------89%--------------    ] 89690 of 100000 complete in 130.7 sec
 [-----------------90%--------------    ] 90066 of 100000 complete in 131.2 sec
 [-----------------90%--------------    ] 90401 of 100000 complete in 131.7 sec
 [-----------------90%--------------    ] 90769 of 100000 complete in 132.2 sec
 [-----------------91%--------------    ] 91241 of 100000 complete in 132.7 sec
 [-----------------91%--------------    ] 91673 of 100000 complete in 133.2 sec
 [-----------------92%--------------    ] 92076 of 100000 complete in 133.7 sec
 [-----------------92%---------------   ] 92529 of 100000 complete in 134.2 sec
 [-----------------93%---------------   ] 93014 of 100000 complete in 134.7 sec
 [-----------------93%---------------   ] 93372 of 100000 complete in 135.2 sec
 [-----------------93%---------------   ] 93722 of 100000 complete in 135.7 sec
 [-----------------94%---------------   ] 94069 of 100000 complete in 136.2 sec
 [-----------------94%---------------   ] 94432 of 100000 complete in 136.7 sec
 [-----------------94%----------------  ] 94748 of 100000 complete in 137.2 sec
 [-----------------95%----------------  ] 95108 of 100000 complete in 137.7 sec
 [-----------------95%----------------  ] 95532 of 100000 complete in 138.2 sec
 [-----------------95%----------------  ] 95975 of 100000 complete in 138.7 sec
 [-----------------96%----------------  ] 96320 of 100000 complete in 139.2 sec
 [-----------------96%----------------  ] 96653 of 100000 complete in 139.7 sec
 [-----------------97%----------------  ] 97053 of 100000 complete in 140.2 sec
 [-----------------97%----------------- ] 97547 of 100000 complete in 140.7 sec
 [-----------------98%----------------- ] 98043 of 100000 complete in 141.2 sec
 [-----------------98%----------------- ] 98542 of 100000 complete in 141.7 sec
 [-----------------98%----------------- ] 98995 of 100000 complete in 142.2 sec
 [-----------------99%----------------- ] 99329 of 100000 complete in 142.7 sec
 [-----------------99%----------------- ] 99646 of 100000 complete in 143.2 sec
 [-----------------99%----------------- ] 99984 of 100000 complete in 143.7 sec
 [-----------------100%-----------------] 100000 of 100000 complete in 143.8 sec

Posterior Plots

In [36]:
fix, ax = plt.subplots(figsize=(16,16))

ax = plt.subplot(411)
plt.hist(slope_and_intercept_trace[:,0], histtype='stepfilled', bins=30, alpha=.7,
                                        color="#A60628", normed=True);
plt.legend(loc="upper left");
plt.title("Posterior distributions");
plt.ylim([0, 50]);
plt.ylabel("Slope");

ax = plt.subplot(412);
plt.hist(slope_and_intercept_trace[:,1], histtype='stepfilled', bins=30, 
         alpha=.7, color="#7A68A6", normed=True);
plt.legend(loc="upper left");
plt.ylim([0, 2]);
plt.ylabel("Intercept");

plt.subplot(413);
plt.hist(outlier_points_trace, histtype='stepfilled', bins=100, alpha=.7,
         color="#348ABD",  normed=True);
plt.ylabel("Outlier Point Probability in Signal");

plt.subplot(414);
plt.hist(spread_outliers_trace, histtype='stepfilled', bins=100, alpha=.7,
         color="#188487",  normed=True);
plt.ylabel("Spread of Outliers");
plt.xlim([0, np.max(spread_outliers_trace)]);

The posterior is good in terms of, I remove most of our uncertainties(they do not look like uniform). Slope is max around 2, intercept gets maximized around 10, which are all good and very close to the main signal values. What else, I could observe spread of the outliers and what could be the ratio of outliers in the total signal, which are also good. I could also get estimate probabilities of spread in outliers and ratio of outliers in the signal which are very important measures in general even if they are not directly used for outlier detection.

But what about the outliers?

Outlier Plots

In [46]:
fig = plt.figure(figsize=(12, 12));
plt.boxplot([y_with_outlier, y]);
plt.xticks([1, 2], ['Outlier Signal', 'Original Signal']);

As expected, the outlier signal spread is larger than the original signal due to outliers, which could be quick sanity check for our methods before plotting the original signal along with outliers. Note that the mean change for different signals, especially the signals that vary a lot in time does not really show anything. I am trying to minimize the spread(variance) of the signal rather than mean. It just happens to be the outlier signal values have higher signal amplitude than the original signal.

In [37]:
fig, ax = plt.subplots(figsize=(12, 8))
ax.scatter(x, y_with_outlier, c='#7A68A6', lw=1, label='Original Signal')
ax.scatter(outlier_x, outlier_y,  facecolors='#A60628', edgecolors='#7A68A6', label='Outliers', lw=1, s=100, alpha=0.2)

ax.set_xlim(np.min(x)-10, np.max(x) + 30);
ax.set_ylim(np.min(y_with_outlier)-10, np.max(y_with_outlier) + 20);
ax.set_title('Original Signal and Outliers');
plt.legend();

Outliers are nicely captured in the signal. Note that since the scale is somehow large, the signal seems to not having any noise, but it is not perfectly linear.

I do not want to go into the discussion of Bayesian is better than Frequentist, but here are my observations for this specific case of advantages and disadvantages of Bayesian approach cin comparison with Frequentist.

Advantages

  • Unlike Median Filtering and FFT, which requires a lot of parameters, MCMC requires only probability threshold.
  • As long as the model is nicely captured, the priors do not have to be known beforehand. The data leads the way.
  • If priors are known, they could be incorporated into the model whereas other models cannot use prior information.
  • Probabilistic, you could reason about probabilities whereas frequentist approaches do not provide a similar functionality like Bayesain methods.
  • We could build complex models, consider the outliers in this example. In Frequentist approaches, there is no similar mechanism to build such a system.

Disadvantages

  • You need to design priors, models and inference schema beforehand whereas you do not need to do these things in frequentist methods.
  • Generally both computation and algorithm-wise, frequentist approaches tend to be faster and simpler.

Next Steps

  • After removing outliers, we could do the linear regression on top of clean signal, which becomes robust as it will not get affected by the presence of outliers.
  • Real data experiments. (The toy datasets are good in terms of explaining some concept in detail as you get to play God, knowing the real values. However, real world data is a different animal, it would be harder and the algorithms applicability may differ drastically for dataset. For example, if you have a very irregular signal, median filtering or FFT may perform better than MCMC as they do not have to model the signal in the first place.
comments powered by Disqus