Machine Learning Newsletter

Building Language Detector via Scikit-Learn

Percentage of English language across all of the languages is decreasing and will likely to do so incoming years. All(nearly) of the tweets are written in English in 2006 whereas in 2013, only half of the tweets are written in English as Japanese, Spanish, Portuguese and other languages increase their share in the tweets. This will likely to continue if we look at the population(excluding China as they have their own networks), India, Russia, Brazil(portuguese) and Germany does not contribute with comparison to their population.(based on the assumption that the smartphones will be ubiquitous and most people who have internet access would have access to smartphones as well).

In [15]:
%matplotlib inline
from IPython.display import Image
Image(url='http://i.imgur.com/Kr5sfJ8.png')
Out[15]:
In [16]:
Image(url='http://i.imgur.com/ZQ5yD96.png')
Out[16]:

Data is from Gnip.

So, as developing economies will contribute more and more to the web(social media is just a specific example of this trend), languages, localization and machine translation will play more and more important role in the consumer-oriented companies. Especially, startups that try to "grow" and protect their market share, this already happened in the "sharing economy", most notably Airbnb and Uber.

For support and a variety of tasks for the companies provide in a multilingual consumer environment, they need to first figure out the language that user complains about the product or the service. They need to translate the feature from English to the local language as well. From busine

Also, language identification and detection is very fundamental for all of the natural language processing task. You need to know the language when you want to do basic stemming in the text.

Building language detection or identification is hard if you want to do keyword search in the text based on dictionaries in the language. There are numerous reasons for that: languages borrow word from each other, same word may mean two different meanings in two different languages. Spelling mistakes, using jargon and abbreviations(especially in Twitter) that do not exist in dictionary.

Yet, with data and machine learning, I could build a relatively good language detector. As long as the data that I will apply the classifier on is more or less in the same lines of Wikipedia dataset, it should generalize well and should do a good job.

In [53]:
#!/usr/bin/env python
# -*- coding: latin-1 -*-

import matplotlib.pyplot as plt
import numpy as np
import scipy
import seaborn as sns

from sklearn import ensemble
from sklearn import feature_extraction
from sklearn import linear_model
from sklearn import pipeline
from sklearn import cross_validation
from sklearn import metrics

# Load module that will load the instances
import load
In [25]:
X, y, label_names = load.get_instances_from_directory('data/text')

X_train, X_test, y_train, y_test = cross_validation.train_test_split(X,
                                                                     y,
                                                                     test_size=0.2,
                                                                     random_state=0)

In the dataset, I compiled Arabic(ar), Chinese(cn), German(de), English(en), Spanish(es), French(fr), Russian(ru) and Turkish(tr) languages from mostly historical texts in the wikipedia. I also used some very common nouns(water and milk) as well with historical texts. I chose 10-12 concepts and get the web pages for those concepts across all the languages.

I chose Wikipedia because wiki articles for different languages share similar website format, so it is relatively easy to scrape data from the website no matter what the language is. The url construction is very simple, as long as you know the the prefix symbol for the language, you could append the word in url construction and that is about it.

One problematic part in general is that you need to use an encoding method different than ASCII. If you are doing this task in Python 3, I think you are better off. But for this IPython notebook, I am using Python 2.7.9 and using latin-1 encoding.

Let's look at some examples in the dataset:

Ar

In [3]:
print(X[0])
ويكيبيديا (تلفظ [wiːkiːbiːdijaː] وتلحن [wikipiːdia] ؛ تلفظ بالإنجليزية /ˌwɪkiˈpiːdi.ə/ ) هي مشروع موسوعة متعددة اللغات، مبنية على الويب ، ذات محتوى حر، تشغلها مؤسسة ويكيميديا ، التي هي منظمة غير ربحية . ويكيبيديا هي موسوعة يمكن لأي مستخدم تعديل وتحرير وإنشاء مقالات جديدة فيها.

De

In [4]:
print(X[2000])
Am 28. Jänner 1756 – einen Tag nach seiner Geburt – wurde Mozart auf die Namen Joannes Chrysostomus Wolfgangus Theophilus getauft. Der erste und letzte der genannten Vornamen verweisen auf den Taufpaten Joannes Theophilus Pergmayr, Senator et Mercator Civicus , der mittlere Vorname Wolfgang auf Mozarts Großvater Wolfgang Nicolaus Pertl. Das griechische Theophilus („ Gottlieb “) hat Mozart später in seine französische Entsprechung Amadé bzw. (selten) latinisierend Amadeus übersetzt.

Cn

In [6]:
print(X[1060])
2011年,发射了首个试验型 空间站 天宫一号 ,并成功完成与后续发射的 神舟八号 飞船的对接,成为继 前苏联 及 美国 之后,第三个有能力独自发射 空间站 的国家。2012年,成功发射了 神舟九号 载人飞船,并首次将女航天员送上太空,完成了与 天宫一号 的自动,手动空间交会对接。又成功建立了 北斗卫星定位系统 ,其在民用及军用上均发挥了重要作用。

En (suprise!)

In [7]:
print(X[3000])
195,046 foreign nationals became British citizens in 2010, [ 348 ] compared to 54,902 in 1999. [ 348 ] [ 349 ] A record 241,192 people were granted permanent settlement rights in 2010, of whom 51 per cent were from Asia and 27 per cent from Africa. [ 350 ] 25.5 per cent of babies born in England and Wales in 2011 were born to mothers born outside the UK, according to official statistics released in 2012. [ 351 ]

Es

In [8]:
print(X[4000])
La intervención romana se produjo en la Segunda Guerra Púnica (218 a. C.), que inició una paulatina conquista romana de Hispania , no completada hasta casi doscientos años más tarde. La derrota cartaginesa permitió una relativamente rápida incorporación de las zonas este y sur, que eran las más ricas y con un nivel de desarrollo económico, social y cultural más compatible con la propia civilización romana. Mucho más dificultoso se demostró el sometimiento de los pueblos de la Meseta, más pobres ( guerras lusitanas y guerras celtíberas ), que exigió enfrentarse a planteamientos bélicos totalmente diferentes a la guerra clásica (la guerrilla liderada por Viriato —asesinado el 139 a. C.—, resistencias extremas como la de Numancia —vencida el 133 a. C.—). En el siglo siguiente, las provincias romanas de Hispania , convertidas en fuente de enriquecimiento de funcionarios y comerciantes romanos y de materias primas y mercenarios, estuvieron entre los principales escenarios de las guerras civiles romanas , con la presencia de Sertorio , Pompeyo y Julio César . La pacificación ( pax romana ) fue el propósito declarado de Augusto , que pretendió dejarla definitivamente asentada con el sometimiento de cántabros y astures (29—19 a. C.), aunque no se produjo su efectiva romanización. En el resto del territorio, la romanización de Hispania fue tan profunda como para que algunas familias hispanorromanas alcanzaran la dignidad imperial ( Trajano , Adriano y Teodosio ) y hubiera hispanos entre los más importantes intelectuales romanos (el filósofo Lucio Anneo Séneca , los poetas Lucano , Quintiliano o Marcial , el geógrafo Pomponio Mela o el agrónomo Columela ), si bien, como escribió Tito Livio en tiempos de Augusto, "aunque fue la primera provincia importante invadida por los romanos fue la última en ser dominada completamente y ha resistido hasta nuestra época", atribuyéndolo a la naturaleza del territorio y al carácter recalcitrante de sus habitantes. La asimilación del modo de vida romano, larga y costosa, ofreció una gran diversidad desde los grados avanzados en la Bética a la incompleta y superficial romanización del norte peninsular.

Fr

In [9]:
print(X[6000])
Il est, par la suite, déchu par le Sénat le 3 avril et exilé à l’ île d’Elbe , selon le traité de Fontainebleau signé le 11 avril, conservant le titre d’Empereur [ 45 ] mais ne régnant que sur cette petite île. Son convoi de Fontainebleau jusqu'à la Méditerranée avant son embarquement pour l'île d'Elbe passe par des villages provençaux royalistes qui le conspuent, il risque d'être lynché à Orgon , ce qui l'oblige à se déguiser [ 46 ] .

Ru

In [10]:
print(X[8000])
Также насчитывается 7 изолированных и 9 неклассифицированных языков . К наиболее популярным исконно африканским языкам относятся языки банту ( суахили , конго ), фула .

Tr

In [11]:
print(X[-10])
Yerel çeşitlere ve bunların arasında karşılıklı alışveriş ve zenginleştirmeye dayalı olması dünyanın herhangi bir büyük mutfağı için olağandır. Ama aynı zamanda büyükşehir geleneğinin zarif tadı ile homojenize ve uyumludur. [129]

In [23]:
vectorizer = feature_extraction.text.TfidfVectorizer(ngram_range=(1, 6),
                             analyzer='char',)
#                             use_idf=False)

pipe = pipeline.Pipeline([
    ('vectorizer', vectorizer),
    ('clf', linear_model.LogisticRegression())
])

pipe.fit(X_train, y_train)

y_predicted = pipe.predict(X_test)

cm = metrics.confusion_matrix(y_test, y_predicted)

I am using char analyzer which extract features on the "character" level rather than word level. This is important for many reasons, generally languages have various prefixes, suffixes and other small word parts that are specific to them. For suffixes, "-able", "-ness", "-s/tion", "-ist" are very common. For prefixes, "a-", "co-", "counter-", "ex-", "dis-", "mal-", "pre-", "under-", "up-", "re-" and "pro-". If I had used a word based tokenization and then use bag of words of tfidf, I would not be able to get these small language specific parts which are quite important for language detection.

If I step back for a second; it would be overly optimistic to expect the training set would cover all of the words and their various forms in a langauge detection problem. More ofthen than not, the things that features depend on not the words themselves but the parts that are specific to the languages. Those parts would create sufficient enough discriminative features and would make the classifier much better. Character ngrams to pick up those parts would be perfect.

Of course, you could try bag of words in the dataset and fail miserably. It is free to try, give it a shot!

Seaborn has this heatmap function which I am fond of for visualization of consusion matrices. Very neat.

In [26]:
plt.figure(figsize=(16, 16))
sns.heatmap(cm, annot=True,  fmt='', xticklabels=label_names, yticklabels=label_names);
plt.title('Confusion Matrix for Languages');

Pretty good. There are some confusion in the Spanish and English but other than that classification is quite successful.

Colorbar and heatmapping does not make a lot of sense for the confusion matrix as the number of documents vary for each language, but it is a nice way to see how many documents confused and to which language. But so far, it is quite successful.

In [27]:
print(metrics.classification_report(y_test, y_predicted,
                                    target_names=label_names))
             precision    recall  f1-score   support

         ar       1.00      0.99      0.99       152
         cn       0.91      1.00      0.95       148
         de       1.00      0.98      0.99       217
         en       0.93      0.97      0.95       232
         es       0.98      0.92      0.95       265
         fr       0.99      1.00      0.99       291
         ru       1.00      0.99      1.00       312
         tr       1.00      1.00      1.00       117

avg / total       0.98      0.98      0.98      1734

I could of couse look at the percentages of the documents as well to make the heatmap more meaningful as the metrics would be normalized to the percentages.

In [33]:
percentage_matrix = 100 * cm / cm.sum(axis=1).astype(float)
In [35]:
plt.figure(figsize=(16, 16))
sns.heatmap(percentage_matrix, annot=True,  fmt='.2f', xticklabels=label_names, yticklabels=label_names);
plt.title('Confusion Matrix for Languages(Percentages)');

Since the pipeline has all of the vectorizer and also the classifier, I could feed raw text to the pipeline and get the prediction from the dataset.

In [51]:
text_blob = """ Je ne dis pas ce que je faisais
                Ich habe nich erzahlt was ich gemacht habe
                Ne yaptığımı söylemedim
                Yo no dije lo que hice
                I did not say what I have done
            """

# Predict the result on some short new sentences:
# Ascii cannot encode the string, convert explicitly to unicode
# Ignore the empty string
sentences = [unicode(ii.strip(), 'utf-8') for ii in text_blob.split('\n') if not ii.strip() == '']
# I could pass raw data, as pipe has all of the parts, pretty neat!
predicted_languages = pipe.predict(sentences)

for sentence, lang in zip(sentences, predicted_languages):
  print(u'{} ----> {}'.format(sentence, label_names[lang]))
Je ne dis pas ce que je faisais ----> fr
Ich habe nich erzahlt was ich gemacht habe ----> de
Ne yaptığımı söylemedim ----> tr
Yo no dije lo que hice ----> es
I did not say what I have done ----> en

Not bad, huh? By the way, all of the sentences mean what the english sentence mean for the curious.(not sure about french, though)

Some questions to tinker on

  • Which languges are more likely to be confused, why?
  • Do two languages are likely to be confused if they share the same alphabet?
  • Some languages use a variant of latin alphabets, could you think of an extra feature that exploits specific letters?
  • How does character n-gram contribute to language detection for the languages that are highly agglutinative like Turkish?
  • How do personal pronouns could be used to detect the language? For the languages personal pronouns are optional and information is carried via verb(like Spanish and Turkish), could this be a discriminative feature?
  • Why do not editors recommend verb forms that agree with pronouns on the languages where verb is either agglutinated due to subject(like Turkish) or changes its form(like in Spanish)? I think this could be a killer feature for people who do not know the language very well. When I write "Yo decir", there should be a dropdown which suggests different verb forms for "decir" in a variety of tenses like "digo", "dijo" and so on.
comments powered by Disqus