Topic Modeling for Keyword Extraction
This post was originally published at Axial’s blog, but the blog’s domain has been expired so I wanted to revive it as I wrote it originally.
I added couple of sections to the blog post since then.
Who is not frustrated on taxonomy more? Is it the user what does not know anything about the taxonomy and classification schemas in the marketplace or the people who try to classify and organize the products into various categories based on some rule?
Or is it the merchant who tries to sell the product does not know anything about the intermediary distribution’s classification rules?
Why do we need the classification anyways?
The classification is needed because we want to group similar products to make the browsing easy, but what if the classification or the taxonomy does not make a lot of sense.
We were frustrated about visibility of our taxonomy for industries to the user. If one of our member wanted to do a search in Axial for a particular field, they needed to know the exact taxonomy name that we use for that field. For example, if one wants to search wood and wood products, they need to know that those would fall in our “Forest Products” taxonomy, which is not an obvious thing when a user wants to do a search in our website.
Not only this limits the query capabilities of the user but also it degrades our search results as we do not know which industry they are interested in from their search query.
In order to tackle this problem and get the following benefits (listed below), we use topic modeling for a number of documents to extract topics and mine phrases to provide a better typeahead functionality to the user.
- How do you extract phrases and keywords from a large number of documents?
- How do you find recurring themes and topics from a corpus without using any metadata information(labels, annotation)?
- How do you cluster a number of documents efficiently and make sure that clusters would be coherent themes?
Topic modeling is a method which is an unsupervised learning and clustering method which enables us to do things listed above.
If you want to deconstruct a document based on various themes it has, as shown above in the image, it is a great tool to explore topics and themes. In the image, every color corresponds to a particular theme and all of the themes have various words. But what does a topic look like?
Topics as Word Distributions
When you see the following words, what do you think:
wood pellet pellets energy biomass production tons renewable plant million fuel forest management heating development carbon facilities
if you think forest, wood or paper, you would be right. These are the subset of words that are extracted from Forest Products industry in opportunities that our members created.
Previously, if our members wanted to search a particular industry, they needed to know the exact name of the industry in order to see the typeahead match in the search bar. We do matching by Named Entity Recognition in Query (NERQ) but it is limited to the exact keyword match in industries.
For example, if they want to do a search related to the “wine” industry, they need to know our taxonomy which corresponds to that industry which is “Distillers and Vintners”. Or, if they want to do a general search related to “shoes”, they need to know that we have a “Footwear” industry.
In order to remedy this problem, we expanded our industry matching to a larger number of words so that we could match “related” and “relevant” keywords to our taxonomies. When a user types in “wine”, we are able to match that keyword to our related taxonomy of “Distillers and Vintners”.
Topic Modeling for Keyword Extraction
We used topic modeling for keyword and phrase extraction using user generated documents that are classified by industry. This provides three main benefits. First, all of the keywords are data-driven and human generated. Second, since every document is associated with various industries, we do not need to associate the documents one by one to each topic, we could mine the keywords and phrases per industry. Last but not least, we could use the industry information as input to our topic sensitive ranking algorithm to improve on search precision.
We created a set of keywords/phrases (around 4000) to expand the matching between what a user types and which industry it will match. Since most of the keywords and phrases are descriptive of the industry itself, they would be intuitive to a user.
The grouping of relevant words is highly suggestive of an abstract theme which is called a topic. Based on the assumption that words that are in the same topic are more likely to occur together, it is possible to attribute phrases or keywords to a particular topic. This allows us to alias a particular topic with a number of phrases and words.
Not all words are created equal
As we are more interested in the more thematic and somehow specific topics, we are not necessarily interested in words that do not contribute a lot to various topics. Usual suspects are the articles (a, an, the) pronouns (I, you, she, he, we, …), prepositions(in, under, of, ..) and also common adverbs and more often than not verbs.
Oh, also the adjectives:
When you catch an adjective, kill it. No, I don’t mean utterly, but kill most of them–then the rest will be valuable. They weaken when they are close together. They give strength when they are far apart. — Mark Twain
Not only they do not contribute to the topics/themes at all, but also they disrupt the word distributions in each topic. Due to these reasons, common words should be removed prior to topic modeling. This is the first rule of thumb. We also removed rare word occurrences, smaller than 3 in the corpus, with the understanding that rare words do not materially contribute to topic distinction. This provides two additional benefits; first we do not have to deal with a large corpus as word distributions in corpora usually have long tails, second we do not unnecessarily do computations on the words classified as unimportant.
Unsupervised Nature of Topic Models
The topic models are unsupervised, i.e. they do not require any prior information on the documents of the corpus, e.g. descriptive labels or other classifications. It infers various topics and themes operating purely on the documents. It is a very powerful and useful tool for quickly exploring various themes in a large corpus. For example, if you are searching for a number of documents that are about one particular theme, e.g. “internet of things”, you want to get the documents that are about that theme (by increasing the recall) rather than documents including the exact phrase of “internet of things”.
By doing so, we created a set of keywords/phrases(around 4000), compare with our industries(around ~200) to map and when you type “wine” in the search bar, you would get “Distillers and Vintners” industry.(yeah, it is hard to guess)
Or, when you type “search engine” in search(so meta):
Adjectives are not so bad
Remember the adjectives, how useless they are for topic modeling. They could come handy in the conversations:
A man’s character may be learned from the adjectives which he habitually uses in conversation. — Mark Twain