Topic Modeling
for Amnesty International Data

@DataKind

Hi!

Bugra Akyildiz

Data Scientist at Axial

@bugraa


Machine Learning Newsletter | mln.io


bugra@nyu.edu
Slides: http://bit.ly/next-ml-boston-2015

Axial

A network that brings private companies with investors together

Enables business owners access to private capital markets

We are hiring! | axial.net

Datakind

Data science in the service of humanity

Nonprofit: NGO's, governments and such


http://www.datakind.org/

Amnesty International

Global movement of people fighting injustice and promoting human rights


http://www.amnestyusa.org/

Data science for humanity and profit

Data

Amnesty International Data


Complaints

  • Fear of Safety
  • Getting threats
  • ...

News

  • Conviction
  • Execution
  • Disappearance
  • ...

Example News

Another One

Machine Learning Model

Topic Modeling


The unsupervised learning method you apply to a bunch of text when you have no idea what to do with them

Topic model is umbrella name for a suite of graphical models for discovering topics or themes in a collection of documents.

Why Topic Model?

  • Useful to explore the documents
  • Easy to apply to unstructured documents
  • Unsupervised, no need for labels
  • Applicable for any medium-to-long form text

Latent Dirichlet Allocation

### LDA - $M$: Number of Documents - $N$: Number of Words - $\alpha$: Dirichlet parameter on document-topic - $\beta$: Dirichlet parameter topic-word - $\theta_i$: Topic distribution of the document $i$ - $\phi_k$: Word distribution for topic $k$ - $w_{ij}$: $j$th word of of $i$th document - $z_{ij}$: topic assignment of the word above
### Observed - $w_{ij}$
### How? - Generate $\theta_i \sim Dir(\alpha)$ - Generate $\theta_k \sim Dir(\beta)$ - For each document $i$ and word $j$: - Choose a topic $z_{ij} \sim Multinomial(\theta_i)$ - Choose a word $w_{ij}$ from - $Multinomial( {\phi_z}_ij )$

Simple LDA Illustration

Topic Keywords

How Similar the Topics are?

### Dirichlet Distribution

Alpha=0.1

Alpha=1

Alpha=10

Alpha=100

Topics

https://datadive.herokuapp.com/

More on in Amnesty Repository

https://github.com/datakind/amnesty

Topics Represented as Graph

https://datadive.herokuapp.com/graph_clusters

Every color represents a topic

## Total Taxonomy ```python event_types = { 'event_1' : "Threat", 'event_2' : "Action", 'event_3' : "Country", 'event_4' : "Person/Group", 'event_5' : "Issue", 'event_6' : "Human Rights Protest Actions" } # Classification Types, Only apply to events 1 and 2 class_types = { 'class_1' : "Death", 'class_2' : "Incarceration with Impending Death", 'class_3' : "Violence", 'class_4' : "Incarceration", 'class_5' : "Ill-treatment", 'class_6' : "Forced Movement", 'class_7' : "Threat of Death", 'class_8' : "Threat of Violence ", 'class_9' : "Threat of Incarceration", 'class_10' : "Risk", 'class_11' : "legal issues", 'class_12' : "Health Concern" , 'class_13' : "Abduction", 'class_14' : "Corporate Abuse", 'class_15' : "Discrimination", 'class_16' : "Torture", 'class_17' : "Threat of Torture" } ```

Fear of Safety

Execution

Disappearance

If all you have is a hammer, everything looks like a nail

Not everything has to be a clustering problem.

Know its limitations

Variants

Correlated Topic Models

Dynamic Topic Models

Supervised Topic Models

Turbo Topics

...

Tools

  • Factorie (Scala)
  • Mallet (Java)
  • Gensim (Python)
  • Stanford Topic Modeling Toolbox (Scala)


Blei's Topic Modeling Page

Questions?

## References - [Introduction to Dirichlet Distribution and Related Processes](https://www.ee.washington.edu/techsite/papers/documents/UWEETR-2010-0006.pdf) - [Introduction to Probabilistiic Topic Models](https://www.cs.princeton.edu/~blei/papers/Blei2012.pdf) - [Review of Topic Models](https://www.cs.princeton.edu/~blei/papers/BleiLafferty2009.pdf) - [A tutorial on Topic Models](https://www.cs.princeton.edu/~blei/papers/icml-2012-tutorial.pdf)