Similarity in the Wild
This post has been published in Axial’s blog post where I used to work in search engine and recommendations. However, the original post has been removed and now I am republishing as I wrote the original post.
Finding similarity across observations is one of the most common tasks/projects something I regulary am tasked for. Collaborative Filtering purely depends on finding similar items(videos for Netflix, products for Amazon) for users. If you are doing a classification task with KNN(K Nearest Neighbor), you are classifying the new observations purely on the distance that you have in the training set. Most of the instance based learning algorithms in one way or another is built on the similarity distances of observations. Clustering algorithms (k-means, manifold learning) depends on the distance between observations. It is very crucial to get the instance representation right as well as to compute the distances between those feature representation to be able to compute the similarity between two observations right.
This is more of an art than an exact science as the feature representaiton as well as the distance metric is highly dependent on the domain and the problem. Also, it depends on what you want to do after estimating similarity between observations. For example, if you are clustering a number of observations, you would be more flexible around errors and differences in computing between observations where if you want to classify the observations into two different classes, you may want to choose another approach.
Merriam Webster defines similarity as following:
a quality that makes one person or thing like another
So we want to find items that are similar to each other. But we need to first answer what an item is (Document representation) and how we will measure an item with other items(Distance Metric).
In order to measure the similarity between two observations, all of the observations should be defined in the same way(using a feature extraction method) to build a feature vector and a distance function which measures the distance between these feature vectors.
Document Representation or Feature Extraction
We have three different types in our observations(documents); tp, opp and companies. ‘tp’ stands for transaction profiles, ‘opp’ stands for opportunity profiles and ‘company’ stands for company(surprise!).
We are using a modified version of Topic-Sensitive Page-Rank to represent our documents regardless of their types. Not considering the types of the documents allow us to have same representation vectors that we could compare the documents regardless of their types.
Recently, we introduced Company Introduction feature for the tp owners to recommend companies that are registered to Axial. In order to do so, we need to find “similar companies” that are close to a given tp id. We have also boolean filters that we could use(we are filtering based on company type and industries in future), but after filtering, it pretty much depends on how similar a tp and a company.
If feature extraction is an important step in any part of machine learning, the distance metric would be the second one. You could have the best feature vectors in the world, but if the distance metric you choose does not make a lot of sense for your feature set or the dimensions in the feature vectors, then the similarity would not make much sense.
For probability distributions, there are many ways to measure the distance(or similarity); $ l_p $ distances ($ l_1 $, $ l_2 $, Chebyshev), cosine, correlation, span-norm, Bhattacharyya, Hellinger and Jensen Shannon Divergence. Based on some experimentation, we decided to use Jensen Shannon Divergence(JSD) to measure the distance between documents.
Let’s talk about a little what JSD actually is.
Jensen Shannon Divergence is nothing but an average of two KL Divergence of two probability distributions with an average of the probability distributions. Its formula is in the following:
This is a nice way to measure the difference between a probability distribution comparing to Y which is a reference distribution. One way to reason about this distance metric is to assume two probability distributions are exactly the same. Then, would be zero. They are exactly same, so the distance is 0. Why , you may ask and that is related to information theory. KL Divergence is also called relative entropy, so one could think the KL divergence as how much information is gained from X assuming that Y is known. If they are same, information gain is zero.
KL Divergence is very nice in terms of what it measures, but it is not a metric that we could depend on. Why is that? It is hidden inside its asymmetric nature. $ KL(X || Y) KL(Y || X) $ and that is a big problem as we cannot create a proper measure of between two observations without considering which is the reference and which one is the one that we measure the distance between the reference vector. In order to prevent this issue, there is a Symmetrised version(well sort of) which just sums up two different KL divergence between each other. $ KL(X || Y) + KL(Y || X) $ in order to reach a version of KL Divergence(which is symmetric), but we have another way to measure the distance as well, which is most probably obvious at this point.
Instead of looking at the distance between probability distributions to each other, what if we measure an average of them with every single of the probability distribution in order to have a symmetric distance metric.
and it does not matter the order anymore:
For a single TP, we first filter out the ones that do not obey the “criteria”(boolean filtering) and then compute the JSD Divergence of a TP for the target documents. The companies that are closest to the TP are our candidates that we should introduce them to the TP owner.