Machine Learning Newsletter

Simple Bayesian Network via Monte Carlo Markov Chain

After I put some material to the blog around Monte Carlo Markov Chain, I get some emails which ask how to do apply MCMC in Bayesian Networks. To keep DRY and KISS principles in mind, here is my attempt to explain the one of the most simple Bayesian Network via MCMC using PyMC, Sprinkler.

Bayesian Networks

Bayesian Networks also known as Belief Networks are a particular subset of probabilistic graphical models(Directed Acyclic Graph). Graphical models are useful in a number of ways as they combine graph theory, machine learning and statistics. They provide a good representation to model the probabilistic structures between random variables. Nodes represent random variables where the edges represent probabilistic dependency, namely conditional probability between two variables(given parent variable, the probability of child variable). The edges are important as they are directed and they show a direct relation between two random variables which makes it easy to both reason about the network structure and also provides a reasonably simple way to compute the joint probabilities between two random variables if we know the prior probability of the parent variable. The structure also implicitly suggests that the two variables that are nondescendents of each other are independent. Therefore, the influence of one variable is limited to its descendents. Similarly if the node(random variable) does not have any connection to parent nodes, then its edge proabability is called unconditional as it does not have any influence from other variables whereas other edges that are influenced called conditional as you might guess.

Note that the conditionality does not necessarily suggest a causal ralationship between two random variables. For Bayesian NEtwork to be a causal network, the relations between two random variables need to be explicitly causal in order to infer any causal structure in the network.

Formal Definition

A Bayesian Network $N$ is a Directed Acyclic Graph which defines a Joint Probability Distribution over random variables.

$$N = \langle N, \Theta \rangle$$

where $N$ represents the graph which is defined over $X_1, X_2, \ldots, X_K$ random variables, then Joint Probability Distribution could be written as in the following:

$$ P_N(X_1, X_2, \ldots, X_K) = \prod_{i=1}^{K} \theta_{X_i|\pi_i}$$

where $\theta_{X_i|\pi_i} = P_N(X_i|\pi_i) $

Let's look at what does it look like to implement in PyMC.

In [1]:
%matplotlib inline
In [2]:
import daft
import pymc
import matplotlib.pyplot as plt
import matplotlib as mlp
import numpy as np
import ggplot
import pandas as pd

COLORS = ["#348ABD", "#A60628", "#7A68A6"]
Couldn't import dot_parser, loading of dot files will not be possible.
/Users/bugra/anaconda/lib/python2.7/site-packages/pytz/__init__.py:29: UserWarning: Module argparse was already imported from /Users/bugra/anaconda/python.app/Contents/lib/python2.7/argparse.pyc, but /Users/bugra/anaconda/lib/python2.7/site-packages is being added to sys.path
  from pkg_resources import resource_stream
In [3]:
%matplotlib inline
In [4]:
pgm = daft.PGM([9, 4], origin=[1, 0.5])
pgm.add_node(daft.Node('r', 'rain', 3, 4, aspect=2))
pgm.add_node(daft.Node('s', 'sprinkler', 9, 4, aspect=3))
pgm.add_node(daft.Node('w', 'grass wet', 6, 2, aspect=4, observed=True))
pgm.add_edge('r', 's')
pgm.add_edge('r', 'w')
pgm.add_edge('s', 'w')
pgm.render()
#pgm.figure.savefig("sprinkler.png", dpi=500)
Out[4]:
<matplotlib.axes.Axes at 0x10b6f8c90>

In this graph, Rain is the unconditional variable as it does not have any parent variable. Sprinkler is the conditioanal variable as it is influenced by Rain. Grass Wet is the observed variable. If you are doing anything with graphical models, draw the graphical model in either a piece of paper or use tikz/daft to draw as above. It makes much easy to reason and to see the relations. There is a reason why people prefer graphical models representation instead of raw equations.

Models

In [5]:
from IPython.display import Image
Image(url='http://upload.wikimedia.org/wikipedia/commons/0/0e/SimpleBayesNet.svg')
Out[5]:

Default values are taken from the example provided in the above. (Taken from Wikipedia)

In [21]:
# Initialization
observed_values = [1.]

rain = pymc.Bernoulli('rain', .2, value=np.ones(len(observed_values)))

p_sprinkler = pymc.Lambda('p_sprinkler', lambda rain=rain: np.where(rain, .01, .4))

# "Real" sprinkler varible
sprinkler = pymc.Bernoulli('sprinkler', p_sprinkler, value=np.ones(len(observed_values)))

p_grass_wet = pymc.Lambda('p_grass_wet', lambda sprinkler=sprinkler, rain=rain: np.where(sprinkler, np.where(rain, .99, .9), 
                                                                                         np.where(rain, .8, 0.2)))
grass_wet = pymc.Bernoulli('grass_wet', p_grass_wet, value=observed_values, observed=True)

model = pymc.Model([grass_wet, p_grass_wet, sprinkler, p_sprinkler, rain])

In p_sprinkler, np.where checks the rain and if True, then puts small probability otherwise put .4 taken from the graphical model retrieved from wikipedia. The reason why we wrap into the Lambda function, because we want to trace this variable in the model in the inference and PyMC does not allow a non-pymc variable to put into the model, so we need to wrap the variable into a PyMC variable using Lambda in the PyMC. This is also true for p_grass_wet as well.

Inference

In [52]:
mcmc = pymc.MCMC(model)
mcmc.sample(10000, 2000)
 [-----------------100%-----------------] 10000 of 10000 complete in 4.9 sec
In [53]:
trace_r = mcmc.trace('rain')[:]
trace_p_sprinkler = mcmc.trace('p_sprinkler')[:]
trace_sprinkler = mcmc.trace('sprinkler')[:]
trace_p_grass_wet = mcmc.trace('p_grass_wet')[:]

Convergence Diagnostics

In [54]:
pymc.Matplot.plot(mcmc)
Plotting p_grass_wet_0
Plotting sprinkler_0
Plotting p_sprinkler_0
Plotting rain_0
In [58]:
geweke = pymc.geweke(mcmc)
pymc.Matplot.geweke_plot(geweke)

For a MCMC procedure, Geweke Z-Scores should be between 2 standard deviation and all of the samples are between 2 standard deviation that suggests it is converged.

In [59]:
dictionary = {
              'Rain': [1 if ii[0] else 0 for ii in trace_r.tolist() ],
              'Sprinkler': [1 if ii[0] else 0 for ii in trace_sprinkler.tolist() ],
              'Sprinkler Probability': [ii[0] for ii in trace_p_sprinkler.tolist()],
              'Grass Wet': [ii[0] for ii in trace_p_grass_wet.tolist()],
              }
df = pd.DataFrame(dictionary)
df.head()
Out[59]:
Grass Wet Rain Sprinkler Sprinkler Probability
0 0.8 1 0 0.01
1 0.2 0 0 0.40
2 0.9 0 1 0.40
3 0.9 0 1 0.40
4 0.9 0 1 0.40

5 rows × 4 columns

Given grass is wet, what is the probability that it was rained?

In [82]:
p_rain_wet = float(df[(df['Rain'] == 1) & (df['Grass Wet'] > 0.5)].shape[0]) / df[df['Grass Wet'] > 0.5].shape[0] 
print(p_rain_wet)
0.345809971208

Compare this number on the wikipedia page where we try to compute $ P(R|W) = \frac{P(R,W)}{P(W)}$ from the simulations, instead of joint probabilities.

Given grass is wet, what is the probability that sprinkler was opened?

In [71]:
p_sprinkler_wet = float(df[df['Sprinkler'] == 1].shape[0]) / df[df['Grass Wet'] > 0.5].shape[0] 
print(p_sprinkler_wet)
0.658887710259

So, everything is perfect? Not really, if we look at the cases where sprinkler did not work and it did not rain, then grass cannot be wet. However, when we look at the probability of that, it is not zero(albeit quite small).

In [80]:
p_not_sprinkler_rain_wet = float(df[(df['Sprinkler'] == 0) & (df['Rain'] == 0)].shape[0]) / df[df['Grass Wet'] > 0.5].shape[0] 
print(p_not_sprinkler_rain_wet)
0.212304894681

Takeaways

  • Graphical models make much easy to reason about this type of networks, where you do not have to figure out the conditional probabilities. From the graph, it is immediate.
  • You could marginalize out the nodes when you take into their "influence" over other variables. You could get very small networks by marginalizing out the variables that you know the dependency(read conditional probability )
  • If you have new evidence, you could create a new branch and update that part. Inference does not have to update all of the variables as the random variables that are independent would not change but only the new evidence's descendents.
comments powered by Disqus