Machine Learning Newsletter

Law of Large Numbers and Central Limit Theorem

If one has to provide a list the important/useful theorems in probability theory, she needs to mention Law of Large Numbers and Central Limit Theorem. However, if you ask me two theorems that are most under appreciated, I would also count these two in that list as well. Most of the textbooks give probability axioms in their first chapters to provide a common ground to build probability sense, yet they will wait for quite later chapters to introduce these two very useful theorems.

Central Limit Theorem(CLT) of those is one of the most astonishing theorem which has both applications in both statistics, information theory and signal processing, it is a little counterintuitive albeit. This does not decrease its value and contribution, though. Understanding CLT is quite crucial to know the reason why Gaussian distribution is so ubiquitous in signal processing and a number of fields.

Contrarily, Law of Large Numbers seems easier to understand and reason about.
In this post, I will introduce LLN and CLT, and provide experiment results to show that CLT actually holds true for a number of different distributions using numpy built-in distributions.

Law of Large Numbers

Law of Large Numbers states that:

sample average($X_n$): $$ \overline{X_n} \rightarrow \frac{1}{n}(X_1 + X_2 \ldots + X_n) $$ (converges to the expected average as the sample size goes to infinity):

$ \overline{X_n} \rightarrow \mu $ where $ n \rightarrow \infty $

More formally, assume that $ X_1, X_2, \ldots, X_n $ are independent and identically distributed random variables with mean $\mu$. Let $\overline{X_n}$ represents the average of $n$ variables. Then, for any $\epsilon \ge 0$, the following must hold true: $$ \lim_{n \rightarrow \infty} P(\lvert \overline{X_n} - \mu \rvert \le \epsilon) = 1 $$

Intuitively, this makes perfect sense. When I flip a fair coin, the mean of flip coins(1 and 0's) will be closers to 0.5 as the number of trials increase. For a small number of trials, the observations may not support the expectation value.(For edge case, when the observation is 1, the difference is actually the largest.) Whole frequentist approaches and even the definition or to reason about probability of some event has connections to law of large numbers. When one has a coin and want to represent the physical events of getting heads or tails in random variables, she actually does not know if 10 heads will happen out of 20 trials. What she knows is that for a large number of trials, the getting head and tails will be equal to each other or at least very close for a fair coin.

In a nutshell, average of independent (many) samples will converge to the mean of the underlying distribution that the observations are sampled from.

In order to show LLN holds true, one could shows average of the samples vs the size of samples. As the size of samples increase, the average will converge to the mean of distribution. For a poisson variable, the mean of distribution vs average of samples could be seen in the next graph.

In [17]:
%matplotlib inline
import numpy as np
import scipy
import matplotlib.pyplot as plt
import pymc

# Pretty colors
COLOR_PALETTE = [    
               "#348ABD",
               "#A60628",
               "#7A68A6",
               "#467821",
               "#CF4457",
               "#188487",
               "#E24A33"
              ]
In [18]:
def plot_average_of_observations_vs_mean(func=np.random.poisson, sample_size=1000000, mean_=10):
    plt.figure(figsize=(12, 12))
    trial_template = 'Trial: {}'.format
    for ii in range(len(COLOR_PALETTE)//2 + 1):
        samples = func(mean_, size=sample_size)
        average = [samples[:jj].mean() for jj in range(1, sample_size, 1000)]
        plt.plot(range(1, sample_size, 1000), average, lw=1.5, label=trial_template(ii + 1), c=COLOR_PALETTE[ii])
    
    plt.title('Mean of Samples vs Sample Size');
    plt.ylabel('Mean of Samples');
    plt.xlabel('Sample Size');
    plt.ylim([9.5, 10.5]);
    plt.legend();
plot_average_of_observations_vs_mean()

In the above graph, we could see that the mean of samples are quite different than the true mean for small number of samples. As sample size grows, the average of the samples will converge to the mean of the sampling distirbution which is 10. This holds true for other probability distributions as well.

In [19]:
plot_average_of_observations_vs_mean(func=np.random.chisquare)

That looks great, our intuition seems to be correct for both poisson and chisquare random variables. Let's look at CLT

Central Limit Theorem

If I use the same notation that I used earlier for LLN, then the equation for CLT can be written as follows:

$$ \frac{\sqrt{n}}{\sigma}(\overline{X_n} - \mu ) \rightarrow \mathcal{N}(0, 1) $$

as $$ n \rightarrow \infty $$

As sample size goes to infinity, the sample mean distribution will converge to a normal distribution. The convergence of distribution concept is quite important as it is not a single metric unlike mean convergence but

The distribution of means of many trials will be Gaussian, even if distributions of each trial are not Gaussian.

The central limit theorem in the nutshell says that mean characteristics of sampling distribution that has well-defined mean and variance measures will be equal to

  1. The mean of the sampling distribution of means is mean of the population from which the samples were drawn. Note that the sampling distribution may not look like a gaussian distribution.
  2. The variance of the sampling distribution is the variance of the population that samples are drawn over the number of samples.
  3. Even if the population does not look like anything Gaussian, the sampling distribution of means will approximate a Gaussian.(Hard to believe, the sampling distribution of mean will be Gaussian for a large enough sample size.
  4. When you add more means of observations, the similarity between the mean sampling distribution with a Gaussian distribution will only increase. As the sampling size will goes to infinity, the distribution becomes a Gaussian distribution(the error term goes to zero).

Noise vs. Gaussian Noise

CLT is also the main reason why noise is generally assumed to be Gaussian noise. This is because, if the process that we are dealing with has random variables that are sampled from different distribution and if these random variables carry different noises on top of their signals, then as number of observations will increase, the error(sum of all the random variables' noise) start to look like Gaussian. Even though we have no idea how many random variables are in the observed signal, which sampling distributions they are coming from, Gaussian noise assumption is quite good due to CLT. This is one of the main reason why Gaussian distribution is so ubiquituous in the signal processing for modeling the noise.

When you subtract mean and dividing by standard deviation of your data points, if your data happens to be Gaussian as well, then what you will end up is the right term on the above equation for CLT.

If I want to look at different distribution and how one may fit a gaussian distribution, I will see that as the number of samples grow, the distributions will start to look like a Gaussian distribution. The theorem however astonishing holds true for a large number of samples.

In [20]:
gumbel = lambda k: np.random.gumbel(1.5, 3.0, [100000, k])[:,:].mean(1)
weibull = lambda k: np.random.weibull(5, [100, 100, k])[:,:].mean(1)
poisson = lambda k: np.random.poisson(5, [100000, k])[:,:].mean(1)
pareto = lambda k: np.random.pareto(5, [10000, k])[:,:].mean(1)

def plot_and_fit_gaussian_distribution(func, sample_size, bins=np.linspace(-5, 20, 100)):
    """ Plot the sampling distribution and fit a gaussian on top 
    of the distribution to see if CLT holds true
    """
    samples = func(sample_size)
    mu = samples.mean()
    sigma = samples.std()
    a = samples.size * (bins[1] - bins[0])
    plt.hist(samples, bins, color=COLOR_PALETTE[2]);
    plt.hold(True)
    plt.plot(bins, a / (np.sqrt(2 * np.pi) * sigma) * np.exp( - (bins - mu) ** 2 / (2 * sigma ** 2)), c=COLOR_PALETTE[0], ls='--')
    plt.grid(False);
    plt.yticks([]);

Pareto Distribution Experiments

In [22]:
plt.figure(figsize=(12, 12));
plot_and_fit_gaussian_distribution(pareto, 1, bins=np.linspace(-5, 20, 1000));
plt.xlabel(r'$x$');
plt.ylabel(r'$p(x)$');
plt.xlim([-2.5, 2.5]);
In [23]:
plt.figure(figsize=(12, 12));
plot_and_fit_gaussian_distribution(pareto, 5, bins=np.linspace(-5, 20, 1000));
plt.xlabel(r'$x$');
plt.ylabel(r'$p(x)$');
plt.xlim([-2.5, 2.5]);
In [24]:
plt.figure(figsize=(12, 12));
plot_and_fit_gaussian_distribution(pareto, 10, bins=np.linspace(-5, 20, 1000));
plt.xlabel(r'$x$');
plt.ylabel(r'$p(x)$');
plt.xlim([-2.5, 2.5]);
In [25]:
plt.figure(figsize=(12, 12));
plot_and_fit_gaussian_distribution(pareto, 100, bins=np.linspace(-5, 20, 1000));
plt.xlabel(r'$x$');
plt.ylabel(r'$p(x)$');
plt.xlim([-2.5, 2.5]);

Poisson Distribution Experiments

In [26]:
plt.figure(figsize=(12, 12));
plot_and_fit_gaussian_distribution(poisson, 1, bins=np.linspace(-5, 20, 100));
plt.xlabel(r'$x$');
plt.ylabel(r'$p(x)$');
In [27]:
plt.figure(figsize=(12, 12));
plot_and_fit_gaussian_distribution(poisson, 5, bins=np.linspace(-5, 20, 100));
plt.xlabel(r'$x$');
plt.ylabel(r'$p(x)$');
In [28]:
plt.figure(figsize=(12, 12));
plot_and_fit_gaussian_distribution(poisson, 10, bins=np.linspace(-5, 20, 100));
plt.xlabel(r'$x$');
plt.ylabel(r'$p(x)$');
In [29]:
plt.figure(figsize=(12, 12));
plot_and_fit_gaussian_distribution(poisson, 20, bins=np.linspace(-5, 20, 100));
plt.xlabel(r'$x$');
plt.ylabel(r'$p(x)$');
In [30]:
plt.figure(figsize=(12, 12));
plot_and_fit_gaussian_distribution(poisson, 100, bins=np.linspace(-5, 20, 100));
plt.xlabel(r'$x$');
plt.ylabel(r'$p(x)$');

Gumbel Distribution

In [31]:
plt.figure(figsize=(12, 12));
plot_and_fit_gaussian_distribution(gumbel, 1, bins=np.linspace(-5, 20, 100));
plt.xlabel(r'$x$');
plt.ylabel(r'$p(x)$');

Only 1 sample, we would not get much luck in terms of how much the distribution looks like a Gaussian distribution, but the graph is important to show where we started from.

In [32]:
plt.figure(figsize=(12, 12));
plot_and_fit_gaussian_distribution(gumbel, 2, bins=np.linspace(-5, 15, 100));
plt.yticks([]);
plt.xlabel(r'$x$');
plt.ylabel(r'$p(x)$');
In [33]:
plt.figure(figsize=(12, 12));
plot_and_fit_gaussian_distribution(gumbel, 5, bins=np.linspace(-5, 15, 100));
plt.yticks([]);
plt.xlabel(r'$x$');
plt.ylabel(r'$p(x)$');

By the time we get to 20 samples, the distribution is almost impossible to distinguish from a Guassian.

In [34]:
plt.figure(figsize=(12, 12));
plot_and_fit_gaussian_distribution(gumbel, 10, bins=np.linspace(-5, 15, 100));
plt.yticks([]);
plt.xlabel(r'$x$');
plt.ylabel(r'$p(x)$');
In [35]:
plt.figure(figsize=(12, 12));
plot_and_fit_gaussian_distribution(gumbel, 20, bins=np.linspace(-5, 15, 100));
plt.yticks([]);
plt.xlabel(r'$x$');
plt.ylabel(r'$p(x)$');
In [36]:
plt.figure(figsize=(12, 12));
plot_and_fit_gaussian_distribution(gumbel, 50, bins=np.linspace(-5, 15, 100));
plt.yticks([]);
plt.xlabel(r'$x$');
plt.ylabel(r'$p(x)$');

As the sample size increases, the mean distribution will eventually look like a Gaussian distribution.

Takeaways

  • The Law of Large Numbers and Central Limit Theorem are under-appreciated in statistics, both are nice, CLT is also quite powerful.
  • Number of heads in coin flips(histogram) after a large nuber of flips look like a Gaussian distribution due to CLT. Note that the coin does not have to be fair.
  • If Gaussian distribution is so popular and ubiquitous across different fields; statistics, machine learning, information theory and signal processing that is partially due to CLT.
  • CLT says that with a large sample size, sample mean will start to look like population mean. Sampling distribution of mean will be Gaussian distribution.

Let's talk about LLN and CLT

Do you know any other applications of LLN and CLT? No, just want to talk about how awesome CLT is? Please, leave a comment.

comments powered by Disqus