# Entropy and Perplexity on Image and Text

```
```

## Entropy¶

Today, I was looking at the confidence interval(the last thing that I remember anyway) and one thing led to another in the wikipedia(which does not happen at all!), I landed on the Diversity Index page. If you are familiar with information theory or entropy, you will get a feeling that I know this from somewhere:

A diversity index is a quantitative measure that reflects how many different types (such as species) there are in a dataset, and simultaneously takes into account how evenly the basic entities (such as individuals) are distributed among those types. The value of a diversity index increases both when the number of types increases and when evenness increases.

The last sentence seems to be somehow related to entropy, no? Wait, it gets better.

For diversity of order one, an alternative equation is $$ D^1 = exp(-\displaystyle\sum_{i=1}^R p_i\ln p_i) $$

This is nothing but exponentiated entropy where the entropy defined as:

$$ H = -\displaystyle\sum_{i=1}^R p_i\ln p_i $$where they mentioned Shannon Entropy index in the remaining article. Apparently, information theory gets itself into ecology quite successfully.

Entropy measures uncertainty in the outcome of a stochastic process. The more uncertainty the data has, the more entropy it has. Similarly, the reverse also holds true. Very simple example, a fair coin has the maximum entropy that it could possible has 1 for `log2`

base(maximum uncertainty). The coin has two faces has min entropy(0) as we already know the outcome of the flipping coin.

In this post, I will look at the entropy, perplexity concepts on text and image. What are they useful and how we can use it, all in Python as you could tell from the Ipython notebook.

```
import io
import string
import urllib2
# 3rd Party
from PIL import Image
import numpy as np
from matplotlib import pyplot as plt
from scipy import ndimage
```

```
text = "When diversity indices are used in ecology, the types of interest are usually species, but they can also be other categories, such as genera, families, functional types or haplotypes. The entities of interest are usually individual plants or animals, and the measure of abundance can be, for example, number of individuals, biomass or coverage. In demography, the entities of interest can be people, and the types of interest various demographic groups. In information science, the entities can be characters and the types the different letters of the alphabet. The most commonly used diversity indices are simple transformations of the effective number of types (also known as 'true diversity'), but each diversity index can also be interpreted in its own right as a measure corresponding to some real phenomenon (but a different one for each diversity index)"
text = text.translate(string.maketrans("",""), string.punctuation)
words = text.split()
wordset = set(chars)
```

```
freq={word: words.count(word) for word in wordset}
print "Word \t\t Count \t Self Information"
word_count_information = []
entropy = 0
for word in wordset:
probability = freq[word] / float(1.0 * len(words))
self_information = np.log2(1.0/probability)
entropy += (probability * self_information)
word_count_information.append([word, freq[word], self_information])
sorted_word_count_information = list(sorted(word_count_information, key=lambda k:k[2], reverse=True))
for ii in sorted_word_count_information:
# Very inelegant way of formatting
separation = '\t\t' if len(ii[0]) < 7 else '\t'
if len(ii[0]) >= 15: separation = ''
print("%s %s %s \t %s"%(ii[0], separation, str(ii[1]), str(ii[2])))
print "\n\nEntropy of complete text: {}".format(entropy)
```

The words that are frequent contributes least amount of information where the least frequent ones are the largest contributor to the information. This is expected as the ones that are rare, most uncertain, has the highest entropy where the frequent ones "the", "a", "an" are the least uncertain.

```
def get_entropy(signal):
""" Uses log2 as base
"""
probabability_distribution = [np.size(signal[signal == i])/(1.0 * signal.size) for i in list(set(signal))]
entropy = np.sum([pp * np.log2(1.0 / pp) for pp in probabability_distribution])
return entropy
```

```
image_url = 'http://i.imgur.com/8vuLtqi.png' # Lena Image
fd = urllib2.urlopen(image_url)
image_file = io.BytesIO(fd.read())
img_color = Image.open(image_file)
img_grey = img_color.convert('L')
img_color = np.array(img_color)
img_grey = np.array(img_grey)
```

```
def get_entropy_of_image(img_grey, neighborhood=5):
dim0, dim1 = img_grey.shape
entropy=np.array(img_grey)
for row in range(dim0):
for col in range(dim1):
lower_x = np.max([0, col - neighborhood])
upper_x = np.min([dim1, col + neighborhood])
lower_y = np.max([0, row - neighborhood])
upper_y = np.min([dim0, row + neighborhood])
area = img_grey[lower_y: upper_y, lower_x: upper_x].flatten()
entropy[row, col] = get_entropy(area)
return entropy
# Get the entropy of the image
img_entropy = get_entropy_of_image(img_grey)
```

### On Image¶

If we want to look at the entropy on image, we could also compute the entropy for windows to measure the evenness or uncertainties. By looking at the definition, one could predict the areas that have a lot of variance would result in a higher entropy and the areas that have lower variance would result in lower entropy.

For image, we would use the *Lena* image to look at its entropy.

```
plt.figure(figsize=(12, 12));
plt.imshow(img_color);
plt.grid(False);
plt.xticks([]);
plt.yticks([]);
```