Similarity via Jaccard Index

February 07, 2017 · ∞ -

At Axial, we had a taxonomy tree for industries and wanted to know if one particular industry is more similar to another industry. The similarity of some of the industries are straightforward if they share a parent, but this similarity is not quantitative and does not produce a metric on how similar the two industries are.

In the search page and other parts of the website, it would be useful for us to be able to compare two different industries whether they belong to the same parent node or not. For example, if user chooses an industry in the on boarding process, we should be able to recommend another industry based on the selected industry. This not only makes sure that user would choose consistent industries taxonomies but also exposes the similar industries she may not know. This is something we planned to do, but let’s look at how this feature is put into the production right now.

Industry Ordering In-App Search

In the app-in search page, when user selects a particular industry in the industry facet search, we want to order the industries of the documents(campaigns, projects and companies) based on the selected industry. Since we do not limit the number of industries for projects, the ordering of the industries for the project would become quite handy to compare against the industry facet.

Similarity in General

For other parts of the website, we could measure the similarity between industries when we want to compare how similar the two documents are in terms of industries they are assigned to or how good the match relationship between campaign and a project. Producing a similarity metric for industries gives a proxy on how similar two documents are.

Industry Similarity via Jaccard Index

In order to do so, we used Jaccard Index to measure similarities between industries based on campaign keywords that are associated to each industry. Let’s review what a Jaccard Index is and then I will explain how it is used to measure similarity between two industries.

Jaccard Index

Jaccard Index is a statistic to compare and measure how similar two different sets to each other. It is a ratio of intersection of two sets over union of them.

$jaccard(A, B) = \frac{A \cap B}{A \cup B}$

If you have representative finite number of elements for a particular observation and you want to compare this observation with another observation, you could count the number of items that are common to both of these two sets. It is a natural fit for comparing posts if you know the representative tags for the posts to measure how similar two articles are in terms of tags.

Its Python implementation is pretty trivial.

def jaccard_index(first_set, second_set):
    """ Computes jaccard index of two sets
        Arguments:
          first_set(set):
          second_set(set):
        Returns:
          index(float): Jaccard index between two sets; it is 
            between 0.0 and 1.0
    """
    # If both sets are empty, jaccard index is defined to be 1
    index = 1.0
    if first_set or second_set:
    index = (float(len(first_set.intersection(second_set))) 
             / len(first_set.union(second_set)))

    return index

first_set = set(range(10))
second_set = set(range(5, 20))
index = jaccard_index(first_set, second_set)
print(index) # 0.25 as 5/20

Industry Similarity

I talked about a little bit on our industry taxonomy earlier; not only we wanted to expose in different ways but also we want to measure how similar they are to each other. The problem is that we did not have a good way to compare two different industries. Since our taxonomy structure is a tree, we could group them by their parent nodes but not necessarily compare and measure how similar they are in a robust and reliable way.

Instead of coming up a heuristic based approach to measure similarity, we decided to use the user data. We have keywords for campaigns that our members created. When members create a Campaign, they could enter a set of keywords along with industries that they chose. We decided to use this information to compare and measure a similarity between industries using campaign keywords.

The idea is simple; if two industries have a lot of common keywords given a campaign a profile, then the chances are they are closely related. As our members choose similar keywords for those industries to represent their campaigns, the likelihood of similarity only increases. By using campaign keywords, we not only reduce the dimensionality of the text(generally the descriptions are much longer than campaign keywords) but also, we do already a feature selection as the campaign keywords should be much more descriptive, dense and rich in information than the descriptions.

Industry Similarity by Jaccard Index

In order to build an industry similarity measure, we first assigned the campaign keywords to each industries. Then, for a given industry, we could compute the jaccard index in a very straightforward manner. But what if we want to compare multiple industries against all of the industries that we have in the database? We could still could use the jaccard index for multiple industry comparison even if it is not formally defined for multiple sets.

However, one can generalize Jaccard index which is very easy to do since what we do is only intersection and union operations across different sets, we could do this among multiple industries in our examples in the following:

$jaccard(A_1, \ldots, A_n) = \frac{A_1 \cap A_2, \ldots, A_{n_1} \cap A_n}{A_1 \cup A_2, \ldots, A_{n-1} \cup A_n}$

This is pretty neat. Note that set order does not matter(icing on the cake).

What was available?

When we already have the intent of the user(industry facet), it is relatively easy to put that industry in the first place and the remainder industries would follow the first industry.

When user chooses Aerospace & Defense before industry ordering, we were displaying the industries of the documents in no particular order:

With industry ordering, we now sort the industries by similarity:

Before “wine” search in Distillers & Vintners:

After industry ordering is in place:

As mentioned before, this ordering is easy to extend to multiple industries as well: