Outlier Detection in Time Series Signals

Bugra Akyildiz
@bugraa
bugra.github.io

Outlier

Data problems

  • Missing Observations
  • Noise
  • Electrical Hum
  • Outliers

Outliers

  • Logistical Problems
  • Data transmission or transcription.
  • Experimental Errors
  • Erroneous Procedures
  • Measurement Errors
  • Wrong Data Entry
  • Component Failures

What is it?

The problem, as always, is what the heck does one mean by "outlier" in these contexts. Seems to be like pornography -- "I know it when I see it." - Berton Gunter, R-help, December 2009
Out-liar

Outlier Definition

An observation in a data set which appears to be inconsistent with the remainder of that set of data

## Median Filter $$ y[m] = median\\{ x[m-\frac{w}{2}, m+\frac{w}{2}] \\} $$ where $w$ is the window size whose median will replace the original data point. - Not linear - Does not smooth the signal. (Does not blur the images) - Robust to outliers!

Toy Example

Median Filter
  • Window size: $w$ = 3
  • Extend left-most and right-most values to compute boundaries

Signal with Outliers

Original Signal Signal with Outlier
Original Signal

Median Filtered Signal

Original Signal

Know Your Tools

Baby Elephant

Fast Fourier Transform (FFT)

Noise Removal

  • Not necessarily distant than main signal
  • Not necessarily random
  • May have dependency to the original signal
  • Outlier Detection

  • Distant observations than main signal
  • ## DFT(Discrete Fourier Transform) Definition $N$ point FFT of $x[n]$ sequence $$X[k] = \sum_{n=0}^{N-1} x[n] e^{\frac{-j 2 \pi n k}{N}}$$ where $w$(frequency) is sampled by $\frac{2\pi k}{N}$ for $k = 0,1, \ldots, N-1$. ## IDFT(Inverse DFT) Definition $$ x[n] = \frac{1}{N} \sum_{k=0}^{N-1} X[k] e^{\frac{j 2\pi n k}{N}} $$ for $n=0, 1, \ldots, N-1$

    FFT Block

    Original Signal
    # Euler's Formula ## $$ e^{jx} = cos x + j sin x $$

    Fourier Expansion

    FFT Visualization

    Fourier Expansion Square Wave

    FFT Visualization

    Fourier Transform Visualization

    FFT Visualization

    FFT for Noise Removal

    Original Signal Noisy Signal

    Frequency Response

    Original Signal

    Frequency Response of Filtered Signal

    Original Signal
    Original Signal

    Outliers detected by FFT

    Original Signal

    Outlier Detection via Monte Carlo Markov Chain

    • Monte Carlo Sampling: To estimate characteristics of a distribution
      • Mean
      • Variance
      • Kurtosis
    • Markov Chain: Stochastic process to sample different _states_ from a stationay distribution
    • MCMC: Design a Markov Chain in such a way that the stationary distribution is the distribution that we want to estimate (target distribution)
    ## Metropolis-Hastings Sampler We want to sample from $p(x)$, creating the following Markov Chain: $$x^{(1)} \rightarrow x^{(2)} \rightarrow \ldots \rightarrow x^{(t)} \rightarrow \ldots $$ where $t$ is the iteration number in Markov Chain - Initialize $x^{(1)}$ - Use proposal distribution $q(x_s|x_t)$ to sample candidate value $x_s$ given the current value of $x_t$ - Assuming the transition matrix is symmetric, compute $a_{prob}(x_t, x_s) = min [1, \frac{p(x_s)}{p(x_t)}]$ - Based on acceptance criterion($a_{prob}$), accept or reject the observation.

    Basic Metropolis Sampler

    
    n_iterations = 100000
    burn_in = 20000
    t = np.zeros(n_iterations)
    ii = 0
    sigma = 1.
    
    t_previous = np.random.randn()
    posterior_distribution = lambda x: np.exp(-x * x/2.)
    proposal_distribution = lambda mu, sigma: sigma * np.random.randn() + mu
    
    while ii < n_iterations:
        t_proposed = proposal_distribution(t_previous, sigma)
        proposed_posterior = posterior_distribution(t_proposed)
        previous_posterior = posterior_distribution(t_previous)
        acceptance_criterion = np.min([proposed_posterior / previous_posterior, 1.0])
        temp = np.random.rand()
        if temp < acceptance_criterion : # acceptance criteria
            t[ii]=t_proposed
            t_previous=t_proposed
        else:
            t[ii]=t_previous                
        ii += 1
        

    This is what it looks like in PyMC

    PyMC Code

    Original Signal and Signal with Outliers

    Original Signal Signal with outliers
    Posterior Plots

    Box Plot of Outlier and Original Signal

    Original Signal

    Why this is important?

    Noisy Signal

    Outliers are captured via MCMC

    Outliers are captured
    Out-liar

    Advantages

    • As long as the model is nicely captured, the priors do not have to be known beforehand. The data leads the way.
    • If priors are known, they could be incorporated into the model whereas other models cannot use prior information.
    • Probabilistic, you could reason about probabilities whereas frequentist approaches do not provide a similar functionality like Bayesain methods.
    • One can build complex models in Bayesian setting whereas this is not possible for Frequentist methods as you depend on the method rather than models.

    Disadvantages

    • You need to design priors, models and inference schema beforehand whereas you do not need to do these things in frequentist methods.
    • Generally both computation and algorithm-wise, frequentist approaches tend to be faster and simpler.
    Baby-Elephant

    Credits

    • http://xkcd.com/
    • http://1ucasvb.tumblr.com/
    • Source of the presentation: https://github.com/bugra/pydata-sv-2014