Machine Learning Newsletter

Entropy and Perplexity on Image and Text

In [139]:
 

Entropy

Today, I was looking at the confidence interval(the last thing that I remember anyway) and one thing led to another in the wikipedia(which does not happen at all!), I landed on the Diversity Index page. If you are familiar with information theory or entropy, you will get a feeling that I know this from somewhere:

A diversity index is a quantitative measure that reflects how many different types (such as species) there are in a dataset, and simultaneously takes into account how evenly the basic entities (such as individuals) are distributed among those types. The value of a diversity index increases both when the number of types increases and when evenness increases.

The last sentence seems to be somehow related to entropy, no? Wait, it gets better.

For diversity of order one, an alternative equation is $$ D^1 = exp(-\displaystyle\sum_{i=1}^R p_i\ln p_i) $$

This is nothing but exponentiated entropy where the entropy defined as:

$$ H = -\displaystyle\sum_{i=1}^R p_i\ln p_i $$

where they mentioned Shannon Entropy index in the remaining article. Apparently, information theory gets itself into ecology quite successfully.

Entropy measures uncertainty in the outcome of a stochastic process. The more uncertainty the data has, the more entropy it has. Similarly, the reverse also holds true. Very simple example, a fair coin has the maximum entropy that it could possible has 1 for log2 base(maximum uncertainty). The coin has two faces has min entropy(0) as we already know the outcome of the flipping coin.

In this post, I will look at the entropy, perplexity concepts on text and image. What are they useful and how we can use it, all in Python as you could tell from the Ipython notebook.

In [104]:
import io
import string
import urllib2

# 3rd Party
from PIL import Image
import numpy as np
from matplotlib import pyplot as plt
from scipy import ndimage
In [97]:
text = "When diversity indices are used in ecology, the types of interest are usually species, but they can also be other categories, such as genera, families, functional types or haplotypes. The entities of interest are usually individual plants or animals, and the measure of abundance can be, for example, number of individuals, biomass or coverage. In demography, the entities of interest can be people, and the types of interest various demographic groups. In information science, the entities can be characters and the types the different letters of the alphabet. The most commonly used diversity indices are simple transformations of the effective number of types (also known as 'true diversity'), but each diversity index can also be interpreted in its own right as a measure corresponding to some real phenomenon (but a different one for each diversity index)"
text = text.translate(string.maketrans("",""), string.punctuation)
words = text.split()
wordset = set(chars)
In [100]:
freq={word: words.count(word) for word in wordset}

print "Word \t\t Count \t Self Information"
word_count_information = []
entropy = 0
for word in wordset:
    probability = freq[word] / float(1.0 * len(words)) 
    self_information = np.log2(1.0/probability) 
    entropy += (probability * self_information)
    word_count_information.append([word, freq[word], self_information])

sorted_word_count_information = list(sorted(word_count_information, key=lambda k:k[2], reverse=True))

for ii in sorted_word_count_information:
    # Very inelegant way of formatting
    separation = '\t\t' if len(ii[0]) < 7 else '\t'
    if len(ii[0]) >= 15: separation = '' 
    print("%s %s %s \t %s"%(ii[0], separation, str(ii[1]), str(ii[2])))
print "\n\nEntropy of complete text: {}".format(entropy)
Word 		 Count 	 Self Information
ecology 	 1 	 7.08746284125
letters 	 1 	 7.08746284125
coverage 	 1 	 7.08746284125
people 		 1 	 7.08746284125
simple 		 1 	 7.08746284125
When 		 1 	 7.08746284125
functional 	 1 	 7.08746284125
interpreted 	 1 	 7.08746284125
individual 	 1 	 7.08746284125
right 		 1 	 7.08746284125
families 	 1 	 7.08746284125
groups 		 1 	 7.08746284125
species 	 1 	 7.08746284125
phenomenon 	 1 	 7.08746284125
information 	 1 	 7.08746284125
transformations  1 	 7.08746284125
alphabet 	 1 	 7.08746284125
demographic 	 1 	 7.08746284125
abundance 	 1 	 7.08746284125
its 		 1 	 7.08746284125
other 		 1 	 7.08746284125
various 	 1 	 7.08746284125
real 		 1 	 7.08746284125
to 		 1 	 7.08746284125
characters 	 1 	 7.08746284125
some 		 1 	 7.08746284125
genera 		 1 	 7.08746284125
biomass 	 1 	 7.08746284125
they 		 1 	 7.08746284125
known 		 1 	 7.08746284125
such 		 1 	 7.08746284125
one 		 1 	 7.08746284125
true 		 1 	 7.08746284125
categories 	 1 	 7.08746284125
science 	 1 	 7.08746284125
plants 		 1 	 7.08746284125
animals 	 1 	 7.08746284125
effective 	 1 	 7.08746284125
most 		 1 	 7.08746284125
corresponding 	 1 	 7.08746284125
commonly 	 1 	 7.08746284125
example 	 1 	 7.08746284125
own 		 1 	 7.08746284125
individuals 	 1 	 7.08746284125
demography 	 1 	 7.08746284125
haplotypes 	 1 	 7.08746284125
measure 	 2 	 6.08746284125
in 		 2 	 6.08746284125
different 	 2 	 6.08746284125
for 		 2 	 6.08746284125
used 		 2 	 6.08746284125
index 		 2 	 6.08746284125
each 		 2 	 6.08746284125
number 		 2 	 6.08746284125
indices 	 2 	 6.08746284125
a 		 2 	 6.08746284125
The 		 2 	 6.08746284125
In 		 2 	 6.08746284125
usually 	 2 	 6.08746284125
and 		 3 	 5.50250034053
as 		 3 	 5.50250034053
entities 	 3 	 5.50250034053
also 		 3 	 5.50250034053
but 		 3 	 5.50250034053
or 		 3 	 5.50250034053
are 		 4 	 5.08746284125
interest 	 4 	 5.08746284125
types 		 5 	 4.76553474636
be 		 5 	 4.76553474636
diversity 	 5 	 4.76553474636
can 		 5 	 4.76553474636
of 		 9 	 3.91753783981
the 		 9 	 3.91753783981


Entropy of complete text: 5.80785595201

The words that are frequent contributes least amount of information where the least frequent ones are the largest contributor to the information. This is expected as the ones that are rare, most uncertain, has the highest entropy where the frequent ones "the", "a", "an" are the least uncertain.

In [144]:
def get_entropy(signal):
    """ Uses log2 as base
    """
    probabability_distribution = [np.size(signal[signal == i])/(1.0 * signal.size) for i in list(set(signal))]
    entropy = np.sum([pp * np.log2(1.0 / pp) for pp in probabability_distribution])
    return entropy
In [19]:
image_url = 'http://i.imgur.com/8vuLtqi.png' # Lena Image
fd = urllib2.urlopen(image_url)
image_file = io.BytesIO(fd.read())
img_color = Image.open(image_file)
img_grey = img_color.convert('L')
img_color = np.array(img_color)
img_grey = np.array(img_grey)
In [14]:
def get_entropy_of_image(img_grey, neighborhood=5):
    dim0, dim1 = img_grey.shape
    entropy=np.array(img_grey)
    for row in range(dim0):
        for col in range(dim1):
            lower_x = np.max([0, col - neighborhood])
            upper_x = np.min([dim1, col + neighborhood])
            lower_y = np.max([0, row - neighborhood])
            upper_y = np.min([dim0, row + neighborhood])
            area = img_grey[lower_y: upper_y, lower_x: upper_x].flatten()
            entropy[row, col] = get_entropy(area)
    return entropy
# Get the entropy of the image
img_entropy = get_entropy_of_image(img_grey) 

On Image

If we want to look at the entropy on image, we could also compute the entropy for windows to measure the evenness or uncertainties. By looking at the definition, one could predict the areas that have a lot of variance would result in a higher entropy and the areas that have lower variance would result in lower entropy.
For image, we would use the Lena image to look at its entropy.

In [23]:
plt.figure(figsize=(12, 12));
plt.imshow(img_color);
plt.grid(False);
plt.xticks([]);
plt.yticks([]);