## Data problems

- Missing Observations
- Noise
- Electrical Hum
- Outliers

# Outliers

- Logistical Problems
- Data transmission or transcription.
- Experimental Errors
- Erroneous Procedures
- Measurement Errors
- Wrong Data Entry
- Component Failures

## What is it?

The problem, as always, is what the heck does one mean by "outlier" in these contexts. Seems to be like pornography -- "I know it when I see it." - Berton Gunter, R-help, December 2009

# Outlier Definition

An observation in a data set which appears to be inconsistent with the remainder of that set of data

## Median Filter
$$ y[m] = median\\{ x[m-\frac{w}{2}, m+\frac{w}{2}] \\} $$
where $w$ is the window size whose median will replace the original data point.
- Not linear
- Does not smooth the signal. (Does not blur the images)
- Robust to outliers!
## Toy Example

- Window size: $w$ = 3
- Extend left-most and right-most values to compute boundaries

## Signal with Outliers

## Median Filtered Signal

## Know Your Tools

# Fast Fourier Transform (FFT)

## Noise Removal

Not necessarily distant than main signal
Not necessarily random
May have dependency to the original signal
## Outlier Detection

Distant observations than main signal
## DFT(Discrete Fourier Transform) Definition
$N$ point FFT of $x[n]$ sequence
$$X[k] = \sum_{n=0}^{N-1} x[n] e^{\frac{-j 2 \pi n k}{N}}$$
where $w$(frequency) is sampled by $\frac{2\pi k}{N}$ for $k = 0,1, \ldots, N-1$.
## IDFT(Inverse DFT) Definition
$$ x[n] = \frac{1}{N} \sum_{k=0}^{N-1} X[k] e^{\frac{j 2\pi n k}{N}} $$
for $n=0, 1, \ldots, N-1$
# FFT Block

# Euler's Formula
## $$ e^{jx} = cos x + j sin x $$
## Fourier Expansion

## Fourier Expansion Square Wave

## Fourier Transform Visualization

## FFT for Noise Removal

## Frequency Response

## Frequency Response of Filtered Signal

## Outliers detected by FFT

## Outlier Detection via Monte Carlo Markov Chain

- Monte Carlo Sampling: To estimate characteristics of a distribution
- Markov Chain: Stochastic process to sample different _states_ from a stationay distribution
- MCMC: Design a Markov Chain in such a way that the stationary distribution is the distribution that we want to estimate
*(target distribution)*

## Metropolis-Hastings Sampler
We want to sample from $p(x)$, creating the following Markov Chain:
$$x^{(1)} \rightarrow x^{(2)} \rightarrow \ldots \rightarrow x^{(t)} \rightarrow \ldots $$
where $t$ is the iteration number in Markov Chain
- Initialize $x^{(1)}$
- Use proposal distribution $q(x_s|x_t)$ to sample candidate value $x_s$ given the current value of $x_t$
- Assuming the transition matrix is symmetric, compute $a_{prob}(x_t, x_s) = min [1, \frac{p(x_s)}{p(x_t)}]$
- Based on acceptance criterion($a_{prob}$), accept or reject the observation.
## Basic Metropolis Sampler

```
n_iterations = 100000
burn_in = 20000
t = np.zeros(n_iterations)
ii = 0
sigma = 1.
t_previous = np.random.randn()
posterior_distribution = lambda x: np.exp(-x * x/2.)
proposal_distribution = lambda mu, sigma: sigma * np.random.randn() + mu
while ii < n_iterations:
t_proposed = proposal_distribution(t_previous, sigma)
proposed_posterior = posterior_distribution(t_proposed)
previous_posterior = posterior_distribution(t_previous)
acceptance_criterion = np.min([proposed_posterior / previous_posterior, 1.0])
temp = np.random.rand()
if temp < acceptance_criterion : # acceptance criteria
t[ii]=t_proposed
t_previous=t_proposed
else:
t[ii]=t_previous
ii += 1
```

## This is what it looks like in PyMC

## Original Signal and Signal with Outliers

## Box Plot of Outlier and Original Signal

## Why this is important?

## Outliers are captured via MCMC

### Advantages

- As long as the model is nicely captured, the priors do not have to be known beforehand. The data leads the way.
- If priors are known, they could be incorporated into the model whereas other models cannot use prior information.
- Probabilistic, you could reason about probabilities whereas frequentist approaches do not provide a similar functionality like Bayesain methods.
- One can build complex models in Bayesian setting whereas this is not possible for Frequentist methods as you depend on the method rather than models.

### Disadvantages

- You need to design priors, models and inference schema beforehand whereas you do not need to do these things in frequentist methods.
- Generally both computation and algorithm-wise, frequentist approaches tend to be faster and simpler.

## Credits

- http://xkcd.com/
- http://1ucasvb.tumblr.com/
### Source of the presentation: https://github.com/bugra/pydata-sv-2014