COLOR_PALETTE = [ "#348ABD", "#A60628", "#7A68A6", "#467821", "#CF4457", "#188487", "#E24A33" ]
import numpy as np import scipy import matplotlib.pyplot as plt import random
Before outlier detection, let's first look at the filtering in frequency domain using FFT. I gave a brief introduction to FFT in here. By doing so, if we have a fairly regular signal(maybe similar to periodic), we could easily remove the noise(or in a sense outliers as they are not similar to the original signal). In order to do so, I will use FFT(Fast Fourier Transform) to reject specific frequency band(known as band-pass filtering) to get the original signal. The difference between using a Moving Average or FIR filter with this approach is that the original signal is preserved whereas Moving Average Filter smooths the signal. So, we are cherry-picking the signal which is coherent and reject the others(noise and outliers) in the frequency domain. I will use a simple sinusoid signal and put some gaussian noise. Gaussian and white noise are not the same thing, though.
# Signal Parameters number_of_samples = 1000 frequency_of_signal = 5 sample_time = 0.001 amplitude = 1 # Noise Parameters mu = 0 sigma = 1 signal = [amplitude * np.sin((2 * np.pi) * frequency_of_signal * ii * sample_time) for ii in range(number_of_samples)] s_time = [ii * sample_time for ii in range(number_of_samples)] noise = [random.gauss(mu, sigma) for _ in range(number_of_samples)] signal_with_noise = [ii + jj for ii, jj in zip(signal, noise)]
plt.figure(figsize=(12, 6)); plt.plot(signal); plt.title('Original Signal');
Clean, nice a sinusoidal signal. Let's look at the signal with noise.
plt.figure(figsize=(12, 6)); plt.plot(signal_with_noise); plt.title('Signal with Noise');
Looks so messy! One can also increase the sigma or mu of Gaussian noise as FFT does not necessarily care about the amplitude or the variance of the signal but the frequency of the signal. So, as long as the signal has a specific frequency range which is somehow different than the original signal. Then, it can be filtered out quite effectively using FFT.
fft_of_signal_with_noise = np.fft.fft(signal_with_noise) f = np.fft.fftfreq(len(fft_of_signal_with_noise),sample_time) def bandpass_filter(x, freq, frequency_of_signal=frequency_of_signal, band = 0.05): if (frequency_of_signal - band) < abs(freq) < (frequency_of_signal + band): return x else: return 0 F_filtered = np.asanyarray([bandpass_filter(x,freq) for x,freq in zip(fft_of_signal_with_noise, f)]); filtered_signal = np.fft.ifft(F_filtered);
We create our bandpass filter in order to accept a specific frequency range and reject others. Then, after rejecting frequencies, we reconsruct our signal using Inverse Fast Fourier Transform(IFFT). That is pretty much it.
plt.figure(figsize=(12, 6)); plt.semilogy(f, abs(fft_of_signal_with_noise), 'o', c=COLOR_PALETTE); plt.title('Frequency response of Signal with noise');
This plot shows two important things. First, FFT of the signal is symmetric. Why is that? Because any real signal has symmetric frequency response. Converse is also true. Any symmetric signal should have real frequency response. Second, frequency response of the original signal at 5 Hz, which are quite high(amplitude-wise). What about other frequency responses? They are all frequency responses of noise. Generally speaking, noise behaves like this in frequency domain just like that. All around scattered frequency components, not coherent, not consistent, just noise.
figure = plt.figure(figsize=(16, 16)); plt.subplot(4,1,1); plt.plot(s_time, signal, COLOR_PALETTE); plt.ylabel('Original Signal'); plt.subplot(4,1,2); plt.plot(s_time, signal_with_noise, COLOR_PALETTE); plt.ylabel('Signal with Noise'); plt.subplot(4,1,3); plt.plot(s_time, filtered_signal, COLOR_PALETTE); plt.ylabel('Filtered Signal'); plt.ylim([-amplitude, amplitude]); plt.subplot(4,1,4); plt.semilogy(f, abs(F_filtered), 'o'); # Frequency response will be symmetric, no need to get the negative component plt.title('Frequency Spectrum of Filtered Signal') plt.xlim([0, 50]); plt.xlabel('frequency(Hz)');
Filtered signal and original signal looks alike to me. If we also investigate the frequency response of the filtered signal, we could actually observe that it has one frequency response which happens to be at 5 Hz which is also the original signal. Since we are only bandpass filtering the signal, we preserved the frequency of the original signal in the sense that we did not change or smooth while removing the noise.
So, let's use the bandpass filters to remove all types of outliers. This is not possible due to three reasons:
- Original signal may not be periodic, may have varying frequencies across time. Therefore, the frequency range that we try to preserve may not be very clear or signal may not be band-limited signal.
- Outliers may have similar frequency responses with the original signal.
- We may want to detect the position of the outlier. (Actually, this is not quite valid reason, we could go back to time domain from frequency domain, but assume we are only in frequency domain)
However, if our signal is band-limited signal and has different frequency response from the outliers and noise, band-pass filtering of a signal is an important tool to eliminate outliers or noise.
FFT is a powerful tool for signal processing but also has a disadvantage. It removes the time information of the signal completely when we take FFT of the signal. Therefore, if the outlier or noise has similar frequency responses but can be discriminated from their behavior across time. Then, FFT approach would not work.
a = 1 x = np.arange(1,50,.5) y = np.sin(-1/x) * np.sin(x) y_with_outlier = np.copy(y) for ii in np.arange(len(x)/10, len(x), len(x)/10.): y_with_outlier[ii]= 4*(random.random()-.5) + y[ii]
Let's construct our original signal in the form of a damped signal. It varies first, then stabilizes. This signal may not be your average signal but it shows change over time. Definitely not periodic. The change itself also changes over time. First, the function increases and desreases drastically and then the change slows down.
plt.figure(figsize=(12, 6)); plt.scatter(x, y, c=COLOR_PALETTE[-1]); plt.xlim([0, 50]); plt.title('Original (Smooth) Signal');
plt.figure(figsize=(12, 6)); plt.scatter(x, y_with_outlier, c=COLOR_PALETTE[-1]); plt.xlim([0, 50]); plt.title('Signal with Outliers');
Nice, we get a signal with outliers. Very good start. Note that, we add the noise to the signal's original value as it is likely to have the outlier values close to the original signal value.
Median Filtering Approach¶
Using a moving average(read as mean filtering) does not work for outliers because outlier introduces a drastic change in the mean of the signal. If you try to smooth the signal, then you end up with a signal that you wish you did not filter the signal in the first place. Then, median filtering comes to rescue. It is robust to noise and outliers. What more do we want?
def get_median_filtered(signal, threshold = 3): """ signal: is numpy array-like returns: signal, numpy array """ difference = np.abs(signal - np.median(signal)) median_difference = np.median(difference) s = 0 if median_difference == 0 else difference / float(median_difference) mask = s > threshold signal[mask] = np.median(signal) return signal
plt.figure(figsize=(12, 6)) window_size = 20 outlier_s = y_with_outlier.tolist() median_filtered_signal =  for ii in range(0, y_with_outlier.size, window_size): median_filtered_signal += get_median_filtered(np.asanyarray(outlier_s[ii: ii+20])).tolist() plt.subplot(2,1,1); plt.scatter(range(len(median_filtered_signal)), median_filtered_signal, c=COLOR_PALETTE[-1]) plt.ylim([-1.5, 1.5]) plt.xlim([0, 100]) plt.title('Median Filtered Signal') plt.subplot(2,1,2); plt.scatter(range(len(y)), y, c=COLOR_PALETTE[-1]) plt.ylim([-1, 1]) plt.xlim([0, 100]) plt.title('Original Signal')
<matplotlib.text.Text at 0x11a6ece10>
Without affecting the original signal, we removed all of the outliers, yet introduce new ones. That is the shortcoming of the method, since change is a lot in a small interval(around peaks), this method introduces medians even if there is no outliers. The unwanted signal is closer to original signal, so there is still improvement. Further, this is easy to prevent with another heuristics. We will only look at the signal that behaves regularly or using a sliding window to have a more robust approach. But this example is important to show how a non-linear filter is able to reject all of the outliers without doing anything very complicated. This has due to mainly two reasons. First one is that $l_1$ norm. Instead of using $l_2$, we use $l_1$ as this norm does not change too much when an outlier gets introduced to the signal whereas $l_2$ norm(since it takes the square of the distance) outliers have much more profound effect. We are preventing using $l_1$ norm. Second one is that we are filtering adaptively. Instead of using a mean filter, and average all of the numbers, we only look at the candidates that could be outliers. First, we get convinced those points are outlier, then we attempt to remove them. This makes it hard to reject the regular signal. What about disadvantages? First and foremost, it is nonlinear filtering. There is no going back. You cannot get the original signal that you begin with. Further, we are also applying a threshold which makes it even harder. There are two important parameters, threshold and sliding window which requires some parameter tuning. But generally outliers are quite either signal specific or application specific. Therefore, this may not be a disadvantage.
FFT(Fast Fourier Transform) Approach¶
So, we filter the noise with FFT. What about outliers? We could do it that as well. FFT to rule them all.
The procedure is coarsely like this: we look at the frequency response of the signal and observe if there is any high frequency component beyond some threshold. If there is, we return the outlier position.
def detect_outlier_position_by_fft(signal, threshold_freq=.1, frequency_amplitude=.01): fft_of_signal = np.fft.fft(signal) outlier = np.max(signal) if abs(np.max(signal)) > abs(np.min(signal)) else np.min(signal) if np.any(np.abs(fft_of_signal[threshold_freq:]) > frequency_amplitude): index_of_outlier = np.where(signal == outlier) return index_of_outlier else: return None outlier_positions =  for ii in range(10, y_with_outlier.size, 5): outlier_position = detect_outlier_position_by_fft(y_with_outlier[ii-5:ii+5]) if outlier_position is not None: outlier_positions.append(ii + outlier_position - 5) outlier_positions = list(set(outlier_positions))
plt.figure(figsize=(12, 6)); plt.scatter(range(y_with_outlier.size), y_with_outlier, c=COLOR_PALETTE, label='Original Signal'); plt.scatter(outlier_positions, y_with_outlier[np.asanyarray(outlier_positions)], c=COLOR_PALETTE[-1], label='Outliers'); plt.legend();
Not bad huh? Actually, we captured all of the outliers! Yet, we have again some parameters to tune(frequency amplitude, frequency threshold, and sliding window size). Well, outlier is ill-defined. Therefore, I think one should expect tweaking parameters to get their outliers.
Do you know any other methods?¶
When I was presented this particular problem last week, these two methods that came to my mind immmediately. What about you? Do you know any other methods that are more effective, scale better, or Bayesian? Please, leave a comment.